Search…
Resources
How to host your models on resources.

Introduction

Resources represent dedicated GPUs to host and use your models. Resources help you scale by providing:
  1. 1.
    More stable API response speeds and throughput
  2. 2.
    An alternative payment model that can save up to 90% in costs compared to pay-per-token
Large language models require computing resources to provide responses, namely a GPU. As discussed in Pricing, you can use your models on a pay-per-token basis or host your models on resources.
When using pay-per-token ("PPT"), your requests are sent to shared GPUs used by all other PPT users on the Forefront platform. Spikes in traffic on these shared resources can cause slowdowns in response speeds.
When using resources, your requests are sent to a dedicated GPU to your team. Instead of paying per token, resources are billed at a flat hourly rate allowing you to process as many requests or tokens as the GPU(s) can process in that time. You can control the state and scaling settings of the resources to meet your use case and request volume. If a resource is live for a fraction of the hour, then usage is prorated to the minute. View resource rates
Forefront hosting vs. self-hosting
Resources are the best option to process high-volume (many requests per minute) batch or real-time use cases. The cost per request when using resources as opposed to pay-per-token when scaling can be many times cheaper.
At a high level, using models on a resource involve the following steps:
  1. 1.
    Add a resource
  2. 2.
    Set scaling settings
  3. 3.
    Use models on a resource

Add a resource

You can add a resource in your dashboard. To begin, navigate to "Resources -> Add resource". Then complete the following steps:
  1. 1.
    Select the model you'd like to host. Each resource can host all models of a matching type. For example, a GPT-J resource will host all of your GPT-J models.
  2. 2.
    Select the performance option. Depending on the model, you will have 1-3 performance options to choose from. Each performance option represents a hardware configuration to host and inference your models.
Click "Add resource" and you will now have a resource added to your "Resources" tab.

Set scaling settings

Scaling settings for your resources work similarly to Amazon EC2 where you can set a fixed amount of GPUs or autoscale to ensure that you have the correct number of GPUs available to handle the load for your application.
The following scaling settings can be configured:
  1. 1.
    Turn autoscaling on or off
  2. 2.
    If autoscaling is off, set a fixed number of GPUs
  3. 3.
    If autoscaling is on, set a minimum and maximum number of GPUs
The scaling settings you should set depend on whether you have a batch or real-time use case.
Batch use case
Real-time use case
Here's a script to benchmark latency and throughput based on your chosen model, resource, and request size.

Use models on a resource

To start using models on your resource, click "Resources -> Select resource -> Turn on". It can take a few minutes for the GPU(s) to turn on. At first, you'll see a gray offline indicator that will switch to a green online indicator when the resource is live.
Once the green online indicator is visible, you can use any of the models in the playground or via API. A resource will automatically host all models of the same model type from your "Models" list.
To use the API of any model hosted on the resource, click the model in "Resources -> Select resource -> Models" to view it's API URL and an example code snippet.

Metrics

While your resource is turned on, you can navigate to "Resources -> Select resource -> Metrics" to view resource metrics. The following metrics are provided:
  1. 1.
    Request volume: the number of requests processed per second
  2. 2.
    Replica count: the number of live GPUs
  3. 3.
    Response time: the 50th and 95th percentile of response speed (in milliseconds)
  4. 4.
    Error count: the number of requests resulting in an error
Last modified 2mo ago
Copy link
On this page
Introduction
Add a resource
Set scaling settings
Use models on a resource
Metrics