How to host your models on resources.
Resources represent dedicated GPUs to host and use your models. Resources help you scale by providing:
- 1.More stable API response speeds and throughput
- 2.An alternative payment model that can save up to 90% in costs compared to pay-per-token
When using pay-per-token ("PPT"), your requests are sent to shared GPUs used by all other PPT users on the Forefront platform. Spikes in traffic on these shared resources can cause slowdowns in response speeds.
When using resources, your requests are sent to a dedicated GPU to your team. Instead of paying per token, resources are billed at a flat hourly rate allowing you to process as many requests or tokens as the GPU(s) can process in that time. You can control the state and scaling settings of the resources to meet your use case and request volume. If a resource is live for a fraction of the hour, then usage is prorated to the minute. View resource rates
Hosting your models on Forefront will save you time on building and maintaining infrastructure. Inferencing models will also typically be more cost efficient than self-hosting due to many model optimizations we've made.
Host multiple models on a single GPU. A feature we've coined "hotswapping" enables you to host multiple models of the same model type on a single GPU. This enables many use cases not possible when you can only host one model per GPU, as is the case without many of our proprietary optimizations.
Every model. 2x faster than HuggingFace. It's not a simple task to host many different large language models. Each model comes with unique challenges during implementation, and the larger the model, the more complex it is to host and inference cost efficiently. We've relentlessly optimized our proprietary implementations of each model to decrease latency and optimize GPU usage, outperforming model implementations on HuggingFace by at least 2x.
Resources are the best option to process high-volume (many requests per minute) batch or real-time use cases. The cost per request when using resources as opposed to pay-per-token when scaling can be many times cheaper.
At a high level, using models on a resource involve the following steps:
- 1.Add a resource
- 2.Set scaling settings
- 3.Use models on a resource
You can add a resource in your dashboard. To begin, navigate to "Resources -> Add resource". Then complete the following steps:
- 1.Select the model you'd like to host. Each resource can host all models of a matching type. For example, a GPT-J resource will host all of your GPT-J models.
- 2.Select the performance option. Depending on the model, you will have 1-3 performance options to choose from. Each performance option represents a hardware configuration to host and inference your models.
Click "Add resource" and you will now have a resource added to your "Resources" tab.
Scaling settings for your resources work similarly to Amazon EC2 where you can set a fixed amount of GPUs or autoscale to ensure that you have the correct number of GPUs available to handle the load for your application.
The following scaling settings can be configured:
- 1.Turn autoscaling on or off
- 2.If autoscaling is off, set a fixed number of GPUs
- 3.If autoscaling is on, set a minimum and maximum number of GPUs
- 4.You can enable scale-to-zero by setting the minimum number of GPUs to 0
The scaling settings you should set depend on whether you have a batch or real-time use case.
A batch job is where you need to send a fixed amount of requests on a one-time or recurring basis.
To run a batch job, you should set the following settings:
- 1.Turn autoscaling off
- 2.Set a fixed number of GPUs based on the number of requests you want to process per minute.
A real-time use case is where you serve requests for end-users on a consistent basis. In other words, a use case that requires your models be available 24/7.
For a real-time use case, you should set the following settings:
- 1.Turn autoscaling on
- 2.Set a minimum number of GPUs so available throughput matches your minimum request volume
- 3.Set a maximum number of GPUs so available throughput is above your peak request volume.
Available throughput can be defined as
num_gpus * requests_per_minute. To get the requests per minute that can be processed by a single GPU based on your chosen model, resource, and request size, use the script below.
Take the following scenario:
An application at its lowest usage processes 200 requests per minute and at peak usage processes 1500 requests per minute. Based on its model, resource, and request size, 100 requests per minute can be processed on a single GPU.
The following settings should be set:
Minimum GPUs: 2
Maximum GPUs: 16
The maximum number should be above peak usage—how far above peak usage will depend on the usage volatility for a given application.
Here's a script to benchmark latency and throughput based on your chosen model, resource, and request size.
To start using models on your resource, click "Resources -> Select resource -> Turn on". It can take a few minutes for the GPU(s) to turn on. At first, you'll see a gray offline indicator that will switch to a green online indicator when the resource is live.
Once the green online indicator is visible, you can use any of the models in the playground or via API. A resource will automatically host all models of the same model type from your "Models" list.
To use the API of any model hosted on the resource, click the model in "Resources -> Select resource -> Models" to view it's API URL and an example code snippet.
While your resource is turned on, you can navigate to "Resources -> Select resource -> Metrics" to view resource metrics. The following metrics are provided:
- 1.Request volume: the number of requests processed per second
- 2.Replica count: the number of live GPUs
- 3.Response time: the 50th and 95th percentile of response speed (in milliseconds)
- 4.Error count: the number of requests resulting in an error