Understand how pricing works on Forefront.
The fundamental cost of hosting large language models comes down to the infrastructure (GPU or TPU) that the model is hosted on. While it's common to pay-per-token to use large language models, the most cost efficient strategy in most use cases, including high-volume production or batch workloads, is to pay for dedicated GPUs at a fixed hourly rate with concurrency-based autoscaling.
On the Forefront platform, you can pay-per-token for base models or pay for dedicated GPUs to host fine-tuned or base models. The models and infrastructure on our platform have several optimizations for better latency and throughput saving you 20% on infrastructure costs with 40% faster response speeds, regardless of hardware. View our GPU pricing, latency, and throughput.
For a custom quote based on your use case, please email our team with:
  1. 1.
    Average requests per minute
  2. 2.
    Peak requests per minute
  3. 3.
    Average input tokens per request
  4. 4.
    Average output tokens per request
Copy link