The fastest open source LLMs for enterprise

Serverless and dedicated inference for the world’s fastest open-source LLMs

  • AWS
  • DigitalOcean
  • Parasail
  • pipeshift
  • Ollama
  • Sponsor 7
  • Neon
  • Sponsor 9
  • Evergrove Labs
  • Sponsor 11
  • Sponsor 12
  • Magnitude
  • Sponsor 14
Wafer Serverless

Access Open Models easily and Pay As You Go

Serverless inference for top open models — no infrastructure, no deployment overhead, just fast APIs

Get Started with Serverless
  • GLM-5.1 logo
    GLM-5.1

    General Language Model 5.1 with strong coding and reasoning capabilities

    Input $1.20
    Output $3.60
    Cache $0.12
    per M tokens
  • Kimi-K2.6 logo
    Kimi-K2.6

    Kimi K2.6 sparse MoE model with a 262K context window.

    Input $0.88
    Output $3.84
    Cache $0.09
    per M tokens
  • Qwen 3.5 logo
    Qwen 3.5
    397B-A17B

    Mixture-of-experts model with 397B total parameters and 17B active. Strong general-purpose reasoning and instruction following.

    Input $0.48
    Output $2.88
    Cache $0.05
    per M tokens
and more
Approved by Benchmarks

Analysis of API providers across performance metrics including latency, output speed, price and others

Actual Benchmarks Get Started with Serverless
Fastest
API providers output speed
for GLM-5.1 (Reasoning)
  • 1 Wafer
    152.1 t/s
  • 2 FriendliAI
    138.3 t/s
  • 3 Fireworks
    97.5 t/s
  • 4 Together.ai
    68.6 t/s
  • 5 CoreWeave
    56.6 t/s
Fastest
API providers output speed
for Qwen 3.5 397B-A17B
  • 1 Wafer
    288.5 t/s
  • 2 Nebius Fast
    276.7 t/s
  • 3 Eigen AI
    267.6 t/s
  • 4 Together.ai
    219.3 t/s
  • 5 Nebius (Base, FP4)
    95.9 t/s
Wafer for Sensitive Workloads

Dedicated endpoints for mission-critical AI workloads

Get set up with the best performance for any custom model, with inference optimization tailored to your hardware, workloads, and production constraints, in less than 24 hours

Low Latency

Experience lightning-fast, real-time responses tailored for voice agents, intelligent copilots, and interactive AI products

High Throughput

Scale coding agents, batch workloads, and parallel generations without bottlenecks

Reliability at Scale

Dedicated endpoints for production workloads that need predictable uptime and stable performance

Workload-Specific Optimization

Tune inference around your model, hardware, traffic patterns, and production constraints

Frequently Asked Questions
  • Workload-specific inference optimization on top of custom GPU kernels. We profile each model on each accelerator family (AMD, NVIDIA), shard experts to fit the cache hierarchy, and tune the continuous-batching scheduler for the model’s KV-cache footprint. The result on Qwen 3.5 397B-A17B is roughly 25% faster than the next-closest provider in our public benchmarks.

  • Dedicated endpoints isolate your traffic from shared inference pools, with zero data retention available for compliance-bound workloads. We sign DPAs, provide SLA-backed uptime, and provision custom-tuned deployments in under 24 hours. See /sla and /dpa for the full terms.

  • Yes. Wafer’s Serverless endpoints follow the OpenAI Chat Completions schema, so existing clients — the OpenAI SDK, LangChain, LiteLLM, agent harnesses like Claude Code or Cline — work by swapping the base URL and API key. Streaming, tool use, and JSON mode are supported across every Serverless model.

  • Repeated prompt prefixes hit a server-side cache and are billed at the Cache rate shown on each model card — typically around 10× cheaper than Input. Cache hits are automatic: no header, no flag. The savings are largest on long system prompts, multi-turn conversations, and document-heavy RAG, where most of the prompt repeats across requests.

  • Three models on Serverless today: GLM-5.1 (strong on coding and reasoning), Kimi-K2.6 (sparse MoE, 262K context window), and Qwen 3.5 397B-A17B (flagship MoE, 397B total / 17B active). More are rolling out — the “and more” row under the model cards previews what’s next.