How Neon Health Cut Voice-Agent TTFT From 800 ms to 550 ms on Wafer

GLM-5.1: Served on a dedicated Wafer endpoint
~550 ms: Client-observed p50 TTFT, down from the prior provider's 800 ms
-30%: p50 TTFT at ~25% higher peak load

Dedicated endpoint
TTFT-based SLA
Signed BAA · US-only

Overview

Neon Health runs HIPAA-compliant voice agents for healthcare. By moving GLM-5.1 inference to a dedicated Wafer endpoint, the team cut client-observed p50 time-to-first-token by roughly 30% — at ~25% higher peak load than its previous dedicated deployment — with that target now written into the SLA.

At a glance

Industry: Healthcare — AI-powered patient access for pharma programs
Use case: Real-time voice agents for payer, provider, and patient phone calls
Model: GLM-5.1
Wafer product: Dedicated inference endpoint with a TTFT-based SLA
Compliance: Signed BAA, US-only data residency, zero data retention

About Neon Health

Roughly 25% of US healthcare spend goes to administrative activities — phone calls, faxes, portal lookups, status checks. About a quarter of every dollar spent in healthcare goes to tasks that don’t directly benefit the patient.

Neon Health builds AI workers that take this work off humans for pharmaceutical patient-access programs, automating benefit verification, prior authorization, and financial-assistance enrollment end to end. Neon’s voice agents place calls to payers, providers, and patients, and automate up to 98% of them without human intervention. Neon made millions of phone calls last year.

What makes Neon’s serving problem unique and difficult is that response times have to be ultra-low-latency to support natural, multi-turn interaction with the person on the line. Some requests are heavy and require information gathering, so every layer of the stack has to be optimized for real-time latency.

The challenge: a phone call is a real-time deadline

Neon budgets 800 ms from the end of user speech to first agent audio. Past that, the call stops feeling like a conversation — and because speech recognition and text-to-speech also consume the budget, the LLM only gets what’s left. The target Neon set for Wafer: TTFT at or below 600 ms, as observed from Neon’s own infrastructure.

Turn-budget diagram: 200 ms for speech-to-text and text-to-speech, 550 ms for Wafer's LLM time-to-first-token, and the 800 ms threshold past which the conversation stops feeling human.

“Any more than that and you start breaking the flow of the conversation. You never have a delay like that when talking to a human, and we are fundamentally social creatures — so we pick up on things like that very quickly.”

Harry Bleyan — Co-founder & CTO, Neon Health

Compliance

Every call touches PHI. The first question Neon asked Wafer was: “Before we move any further, do you have a BAA you can sign?” US-only data residency was equally non-negotiable.

Reliability

A previous inference provider gave Neon a verbal capacity commitment, then pulled it to serve larger customers right before a launch. Neon was shopping for a partner it could trust long-term.

What Neon ran before

Neon built its MVP on closed-source frontier APIs. Model intelligence met the bar, but latency didn’t. Neon maxed out the rate limits on the providers’ highest tiers, and shared-pool variance meant a system that benchmarked well at midnight ran up to 4× slower during business hours, when most calls come in.

A managed deployment on a hyperscaler’s accelerators followed, but it couldn’t guarantee latency or dedicated capacity. Neon then moved to open weights on a dedicated deployment with another inference provider. That was production-viable — about 800 ms p50 / 1,300 ms p90 client-observed TTFT at ~400 requests per minute — but the LLM alone was consuming the entire conversational-turn budget.

The solution

Client-observed p50 TTFT cut from 800 ms to ~550 ms

Neon Health’s own measurement after moving GLM-5.1 to a dedicated Wafer endpoint — at ~25% higher peak load.

Bar chart comparing the previous provider's 800 ms time-to-first-token with Wafer's 550 ms, sitting below Neon's 600 ms SLA target.

What changed

Same model, faster serving — GLM-5.1 on a dedicated, TTFT-tuned Wafer stack

Headroom under the wall — ~250 ms of the 800 ms turn budget freed up

Inside the SLA — TTFT now sits below the 600 ms target Neon set

The fix: a TTFT-first serving stack

Wafer benchmarked Neon’s exact traffic shape — long cached prompts, short outputs, business-hours bursts. From there, Wafer’s forward-deployed engineers, alongside the autonomous performance-engineering agent, swept configs and implemented custom kernels. The team also quantized GLM-5.1 while passing Neon’s quality evals — in Neon’s words, with better results than other quantized models they’d seen.

Cache-aware routing

With a cache-hit rate above 95%, routing is the unassuming backbone. Each call’s growing context lands back on the replica that already holds its KV, so the cost is mostly the new incoming tokens.
Chunked prefill + custom kernels

Chunked prefill keeps one long prompt from stalling admission for everyone else, with custom kernels on the GEMM and MoE paths.
Stepped decode over speculative decoding

Speculative decoding was tested and rejected — TTFT became unstable under bursts. Short decode steps won instead: between steps the scheduler admits newly arrived prefills, which holds TTFT down when twenty requests land in a 100 ms window from an auto-dialer.
Throughput left on the table

Per-stream decode speed is tuned below what the hardware can do, in exchange for stable TTFT and low inter-token jitter — smooth text-to-speech matters more than tokens per second nobody hears.

Latency that doesn’t fall off a cliff

p50 TTFT barely moves as sustained load climbs — where the previous dedicated provider and shared-pool APIs curve up and off the chart.

Line chart of p50 TTFT versus sustained requests per second for the Wafer dedicated endpoint, the previous provider, and a shared-pool closed API. Wafer stays flat and lowest as load increases.

Wafer dedicated endpoint
Previous provider (dedicated)
Shared-pool / closed API

Measured from US-West clients: 237 ms p50 / 502 ms p90 TTFT at a sustained 6.5 requests per second on ~20k-token prompts.

Forward-deployed inference engineering

The deployment model matters as much as the serving config. From the first call, Neon has worked in a shared Slack channel with the entire Wafer team — benchmark ladders, serving configs, and root-cause analyses get posted there directly. When the node began approaching its serviceable headroom during peak hours, Wafer flagged it and proposed the scaling path.

“I really appreciate when the vendors we work with are proactive in their outreach, their suggestions, and their recommendations — rather than being reactive, or not reactive at all. It feels like a collaboration, rather than: here’s the menu, don’t talk to us.”

Harry Bleyan — Co-founder & CTO, Neon Health

Because of Wafer’s white-glove, forward-deployed approach, Neon’s endpoint went from benchmark kickoff to production traffic in under two weeks — at the best performance in the market for their needs.

“The performance we’re getting on the models we benchmarked was way better on Wafer than on other inference providers. You have the lowest latency we’ve seen from any provider we’ve tried, and it doesn’t go off a cliff when you increase the requests per minute. We get very consistent performance across our expected utilization range.”

Harry Bleyan — Co-founder & CTO, Neon Health

“It feels like a collaboration. It’s great to see a dedicated group of people pushing hard to support our needs, all while delivering exceptional performance on all of the hard engineering metrics that we require.”

Harry Bleyan

Co-founder & CTO, Neon Health

Why this matters

Few environments are as demanding as voice in LLM serving. The 800 ms limit for natural conversation is a fixed constraint that forces optimization across the full serving stack. Hitting it took a series of practical, high-impact choices:

Selecting an open-weights model and quantization that satisfied rigorous internal evals
Strategic node placement
Cache-aware routing that achieved a 95%+ hit rate
Custom kernel and low-level optimization work at the runtime level
Prioritizing variance reduction over peak throughput in decode scheduling
Maintaining a transparent view of the entire network path between the server and the end user

Your voice agents live or die
on latency

Wafer measures your traffic, optimizes a dedicated endpoint for your TTFT SLA, and gets you to production in under two weeks — without crashing during call-volume spikes.

Get my benchmark Talk to our CEO

Signed BAA · US-only data residency · TTFT written into the SLA

Overview

At a glance

About Neon Health

The challenge: a phone call is a real-time deadline

What Neon ran before

The solution

The fix: a TTFT-first serving stack

Latency that doesn’t fall off a cliff

Forward-deployed inference engineering

Why this matters

Your voice agents live or dieon latency

Your voice agents live or die
on latency