How to run batch jobs on the Forefront platform


Batch jobs are used to automate tasks that need to be performed on a regular basis. Some example uses of a batch job include:
  • Analyze the sentiment of Twitter comments on an hourly basis
  • Summarize papers published to arXiv on a daily basis
  • Extract addresses from invoices processed on a weekly basis
Batching is optimal whenever you have a large amount of requests that you'd like to process on a one-time or recurring basis. To run a batch job with the Forefront API, there are a few best practices:
  1. 1.
    Use a resource to host the models that will be processing batch jobs
  2. 2.
    Set scaling settings for a batch use case
  3. 3.
    Send requests in parallel equal to the batch size of the resource
  4. 4.
    Include the Resources API in your workflow to turn your resource on and off

Use a resource

Resources represent dedicated GPUs to host and use your models. Using resources are ideal for batch jobs for a few reasons:
  1. 1.
    Resources are typically more cost efficient than pay-per-token
  2. 2.
    You can control your resource scaling settings to be able to process as many requests per minute as needed
Check out our Resources guide to learn how to add a resource and begin using your models on one.

Set scaling settings

To run a batch job, you should set the following scaling settings:
  1. 1.
    Turn autoscaling off
  2. 2.
    Set a fixed number of GPUs based on the number of requests you want to process per minute.
The number of requests that a single GPU can process will depend on your model size, resource type, and request size (prompt + completion tokens). Here's a script to benchmark requests per minute based on your use case..
If a GPU processes 100 requests per minute and you need to process 1,000 requests per minute, you should set the number of GPUs to 10.

Run a batch job

Each resource has a set batch size to optimize throughput. The batch size is the amount of requests that can be processed in parallel on a single GPU. When running a batch job, you should send requests in parallel that match the batch size of the GPU. The following code snippet shows how to do this:
Resource options for models can change over time and so will the associated batch sizes. Email us your model and resource type and we'll get in touch with the current batch size.
import math
from typing_extensions import TypedDict, Literal
from typing import List, Dict, Optional, AnyStr, Any, Callable
import requests
import asyncio
class CompletionRequestBody(TypedDict):
prompt: Optional[str]
max_tokens: Optional[int]
temperature: Optional[float]
top_p: Optional[float]
top_k: Optional[int]
tfs: Optional[float]
repetition_penalty: Optional[float]
stop: Optional[List[str]]
logit_bias: Optional[Dict[str, float]]
bad_words: Optional[List[str]]
logprobs: Optional[int]
n: Optional[int]
echo: Optional[bool]
stream: Optional[bool]
CompletionResult = AnyStr
class RawCompletionResponse(TypedDict):
result: List[Dict[Literal['completion'], str]]
timestamp: int
logprobs: Any
model: str
class ForefrontClient:
def __init__(self, api_key: str, endpoint: str, num_replicas: int, batch_size_per_replica: int):
self.api_key = api_key
self.endpoint = endpoint
self.batch_size = num_replicas * batch_size_per_replica // 1
async def make_request(self, data: CompletionRequestBody) -> CompletionResult:
response =
'authorization': f'Bearer {self.api_key}'
if response.status_code == 401:
raise RuntimeError(
f'Your authentication is invalid. Go to your settings page on the Forefront dashboard to retrieve your API key'
if response.status_code != 200:
raise RuntimeError(
f'API Request failed with status code: {response.status_code}')
data: RawCompletionResponse = response.json()
return data['result'][0]['completion']
async def make_many_requests(self, data: List[CompletionRequestBody]) -> List[CompletionResult]:
tasks = []
for d in data:
return await asyncio.gather(*tasks)
async def batch_requests(self, data: List[CompletionRequestBody], result_handler: Callable = None) -> List[CompletionResult]:
l = len(data)
all_results: List[CompletionResult] = []
for i in range(math.ceil(l / self.batch_size)):
end_idx = (
(i + 1) * self.batch_size
if (i + 1) * self.batch_size < l else
l - 1
items = data[i * self.batch_size: end_idx]
results: List[CompletionResult] = await self.make_many_requests(items)
for r in results:
if result_handler is not None:
return all_results
interface CompletionRequestParams{
prompt: string;
max_tokens: number;
temperature?: number;
top_p?: number;
top_k?: number;
tfs?: number;
repetition_penalty?: number;
stop?: string[]
bad_words?: string[]
n?: number;
logit_bias?: {
[key: string]: number;
stream?: boolean;
logprobs?: number;
echo?: boolean;
type CompletionsResult = string;
interface RawResponseBody{
result: Array<{completion: string}>;
timestamp: number;
logprobs: any;
model: string;
class ForefrontClient{
apiKey: string;
endpoint: string;
batchSize: number;
constructor(apiKey: string, endpoint: string, numReplicas: number, batchSizePerReplica: number){
this.apiKey = apiKey;
this.endpoint = endpoint;
this.batchSize = Math.round(numReplicas * batchSizePerReplica);
async makeRequest(data: CompletionRequestParams): Promise<CompletionsResult>{
const response = await fetch(this.endpoint, {
method: 'POST',
body: JSON.stringify(data),
headers: {
'content-type': 'application/json',
'authorization': `Bearer ${this.apiKey}`
if(response.status != 200){
throw new Error(response.statusText)
const body: RawResponseBody = await response.json();
return body?.result[0]?.completion;
async makeMultipleRequests(data: CompletionRequestParams[]): Promise<Array<CompletionsResult>>{
const results: CompletionsResult[] = await Promise.all(;
return results;
async batchRequests(data: CompletionRequestParams[], resultHandler: (((result: CompletionsResult) => void) | null) = null): Promise<Array<CompletionsResult>>{
const numData = data?.length;
const out: CompletionsResult[] = [];
for(let i = 0; i < Math.ceil(numData / this.batchSize); i++){
const endIdx = (i + 1) * this.batchSize >= numData ? numData - 1 : (i + 1) * this.batchSize;
const thisBatch = data.slice((i * this.batchSize), endIdx);
const results = await this.makeMultipleRequests(thisBatch);
if(resultHandler != null){
for(const result of results){
return out;
You can use the Resources API to turn resources on and off.