The fastest open source LLMs for enterprise

Serverless and dedicated inference for the world’s fastest open-source LLMs

Open App For Enterprise

Evergrove Labs

Wafer Serverless

Access Open Models easily and Pay As You Go

Serverless inference for top open models — no infrastructure, no deployment overhead, just fast APIs

Get Started with Serverless

GLM-5.1

General Language Model 5.1 with strong coding and reasoning capabilities

Input $1.00

Output $3.20

Cache $0.10

per M tokens
Kimi-K2.6

Kimi K2.6 sparse MoE model with a 262K context window.

Input $0.68

Output $3.15

Cache $0.07

per M tokens
Qwen 3.5
397B-A17B

Mixture-of-experts model with 397B total parameters and 17B active. Strong general-purpose reasoning and instruction following.

Input $0.43

Output $2.60

Cache $0.04

per M tokens

and more

Approved by Benchmarks

Analysis of API providers across performance metrics including latency, output speed, price and others

Actual Benchmarks Get Started with Serverless

Fastest

API providers output speed

for GLM-5.1 (Reasoning)

1 Wafer
152.1 t/s
2 FriendliAI
138.3 t/s
3 Fireworks
97.5 t/s
4 Together.ai
68.6 t/s
5 CoreWeave
56.6 t/s

Fastest

API providers output speed

for Qwen 3.5 397B-A17B

1 Wafer
288.5 t/s
2 Nebius Fast
276.7 t/s
3 Eigen AI
267.6 t/s
4 Together.ai
219.3 t/s
5 Nebius (Base, FP4)
95.9 t/s

Wafer for Sensitive Workloads

Dedicated endpoints for mission-critical AI workloads

Get set up with the best performance for any custom model, with inference optimization tailored to your hardware, workloads, and production constraints, in less than 24 hours

Low Latency

Experience lightning-fast, real-time responses tailored for voice agents, intelligent copilots, and interactive AI products

High Throughput

Scale coding agents, batch workloads, and parallel generations without bottlenecks

Reliability at Scale

Dedicated endpoints for production workloads that need predictable uptime and stable performance

Workload-Specific Optimization

Tune inference around your model, hardware, traffic patterns, and production constraints

Frequently Asked Questions

Workload-specific inference optimization on top of custom GPU kernels. We profile each model on each accelerator family (AMD, NVIDIA), shard experts to fit the cache hierarchy, and tune the continuous-batching scheduler for the model’s KV-cache footprint. The result on Qwen 3.5 397B-A17B is roughly 25% faster than the next-closest provider in our public benchmarks.
Dedicated endpoints isolate your traffic from shared inference pools, with zero data retention available for compliance-bound workloads. We sign DPAs, provide SLA-backed uptime, and provision custom-tuned deployments in under 24 hours. See /sla and /dpa for the full terms.
Yes. Wafer’s Serverless endpoints follow the OpenAI Chat Completions schema, so existing clients — the OpenAI SDK, LangChain, LiteLLM, agent harnesses like Claude Code or Cline — work by swapping the base URL and API key. Streaming, tool use, and JSON mode are supported across every Serverless model.
Repeated prompt prefixes hit a server-side cache and are billed at the Cache rate shown on each model card — typically around 10× cheaper than Input. Cache hits are automatic: no header, no flag. The savings are largest on long system prompts, multi-turn conversations, and document-heavy RAG, where most of the prompt repeats across requests.
Three models on Serverless today: GLM-5.1 (strong on coding and reasoning), Kimi-K2.6 (sparse MoE, 262K context window), and Qwen 3.5 397B-A17B (flagship MoE, 397B total / 17B active). More are rolling out — the “and more” row under the model cards previews what’s next.