- 1 Wafer152.1 t/s
- 2 FriendliAI138.3 t/s
- 3 Fireworks97.5 t/s
- 4 Together.ai68.6 t/s
- 5 CoreWeave56.6 t/s
The fastest open source LLMs for enterprise
Serverless and dedicated inference for the world’s fastest open-source LLMs
Access Open Models easily and Pay As You Go
Serverless inference for top open models — no infrastructure, no deployment overhead, just fast APIs
Get Started with Serverless- GLM-5.1
General Language Model 5.1 with strong coding and reasoning capabilities
Input $1.20Output $3.60Cache $0.12per M tokens - Kimi-K2.6
Kimi K2.6 sparse MoE model with a 262K context window.
Input $0.88Output $3.84Cache $0.09per M tokens - 397B-A17BQwen 3.5
Mixture-of-experts model with 397B total parameters and 17B active. Strong general-purpose reasoning and instruction following.
Input $0.48Output $2.88Cache $0.05per M tokens

Analysis of API providers across performance metrics including latency, output speed, price and others
Actual Benchmarks Get Started with Serverless- 1 Wafer288.5 t/s
- 2 Nebius Fast276.7 t/s
- 3 Eigen AI267.6 t/s
- 4 Together.ai219.3 t/s
- 5 Nebius (Base, FP4)95.9 t/s
Dedicated endpoints for mission-critical AI workloads
Get set up with the best performance for any custom model, with inference optimization tailored to your hardware, workloads, and production constraints, in less than 24 hours
Low Latency
Experience lightning-fast, real-time responses tailored for voice agents, intelligent copilots, and interactive AI products
High Throughput
Scale coding agents, batch workloads, and parallel generations without bottlenecks
Reliability at Scale
Dedicated endpoints for production workloads that need predictable uptime and stable performance
Workload-Specific Optimization
Tune inference around your model, hardware, traffic patterns, and production constraints
-
Workload-specific inference optimization on top of custom GPU kernels. We profile each model on each accelerator family (AMD, NVIDIA), shard experts to fit the cache hierarchy, and tune the continuous-batching scheduler for the model’s KV-cache footprint. The result on Qwen 3.5 397B-A17B is roughly 25% faster than the next-closest provider in our public benchmarks.
-
Dedicated endpoints isolate your traffic from shared inference pools, with zero data retention available for compliance-bound workloads. We sign DPAs, provide SLA-backed uptime, and provision custom-tuned deployments in under 24 hours. See /sla and /dpa for the full terms.
-
Yes. Wafer’s Serverless endpoints follow the OpenAI Chat Completions schema, so existing clients — the OpenAI SDK, LangChain, LiteLLM, agent harnesses like Claude Code or Cline — work by swapping the base URL and API key. Streaming, tool use, and JSON mode are supported across every Serverless model.
-
Repeated prompt prefixes hit a server-side cache and are billed at the Cache rate shown on each model card — typically around 10× cheaper than Input. Cache hits are automatic: no header, no flag. The savings are largest on long system prompts, multi-turn conversations, and document-heavy RAG, where most of the prompt repeats across requests.
-
Three models on Serverless today: GLM-5.1 (strong on coding and reasoning), Kimi-K2.6 (sparse MoE, 262K context window), and Qwen 3.5 397B-A17B (flagship MoE, 397B total / 17B active). More are rolling out — the “and more” row under the model cards previews what’s next.