The 500ms Threshold: Why Latency Kills Voice AI

The Conversation Contract

Text chat is forgiving. Users wait seconds for a response—they're typing, thinking, context-switching. Voice is not forgiving. Voice is a conversation, and conversations have a contract: respond within 500 milliseconds or the illusion breaks.

This isn't arbitrary. It's psychoacoustic. Human conversation operates on turn-taking rhythms measured in hundreds of milliseconds. When you ask someone a question, the gap before their response signals whether they're thinking, confused, or absent. A 300ms pause feels natural. A 700ms pause feels like something's wrong. A 1.5-second pause makes you wonder if the connection dropped.

Voice AI inherits this contract. You can't negotiate with user expectations shaped by millions of years of evolution.

The Latency Budget

Every voice AI interaction runs a latency gauntlet:

Stage	Best Case	Typical	Notes
User audio capture	0ms	50-100ms	VAD detection delay
Network (up)	20ms	50-150ms	Highly variable
ASR (Speech-to-Text)	75ms	150-300ms	Scribe: ~150ms streaming
LLM inference	100ms	200-1000ms+	Model dependent
TTS (Text-to-Speech)	75ms	100-500ms	Flash v2.5: ~75ms
Network (down)	20ms	50-150ms	Highly variable
Audio playback start	0ms	50ms	Buffer loading

Best case total: ~300ms Typical total: 600-1500ms Acceptable threshold: <500ms

The math is brutal. Even with perfectly optimized components, network variability alone can push you over the threshold.

Why Traditional TTS Fails

The original sin of voice AI was building TTS for quality, not speed.

Traditional high-fidelity models like ElevenLabs Multilingual v2 produce beautiful audio—nuanced prosody, emotional range, breath simulation. They also take 1-2 seconds to generate a sentence. In a dubbing workflow where you're producing an audiobook, this is fine. In a conversation, it's death.

This created a forced choice: robot voices (fast) or human-like voices (slow).

ElevenLabs' Flash v2.5 model exists specifically to break this trade-off. Through aggressive knowledge distillation—training a smaller, faster "student" model from the larger "teacher" model—they achieved ~75ms latency while preserving acceptable quality.

The latency hierarchy:

Model	Latency	Quality	Use Case
Multilingual v2	1000ms+	Excellent	Audiobooks, dubbing
Turbo v2.5	~250ms	Good	Interactive apps
Flash v2.5	~75ms	Acceptable	Conversation

Flash sacrifices prosodic range for raw speed. The voice sounds slightly less "alive"—but it responds fast enough to feel present.

The Barge-In Problem

Here's where things get architecturally complex.

In text chat, you wait for the model to finish before responding. In voice, users interrupt. They ask a question, the agent starts answering, and three words in the user says "wait, actually—" and changes the question.

This is called "barge-in," and it breaks everything if you don't handle it correctly.

The Naive Approach (Fails)

Agent generates complete response
Agent plays audio
User interrupts mid-playback
System... ignores the interruption? Plays over the user? Crashes?

The Correct Architecture

Agent streams response in chunks
VAD continuously monitors for user speech
On user speech detection → interruption event fires
Client immediately flushes audio buffer (agent stops talking)
Server generates agent_response_correction event

That last step is critical. The correction event contains what the agent actually said before being cut off.

Why does this matter? Context integrity.

If the agent intended to say "The weather is 75 degrees and sunny" but was cut off after "The weather is 75," the LLM's context must reflect that the user only heard "The weather is 75." Without this correction, the model might assume the user knows it's sunny—a false premise that compounds into hallucination.

The Protocol

Server → Client: audio chunk 1 ("The weather is...")
Server → Client: audio chunk 2 ("seventy-five...")
Client: [playing audio]
VAD: [detects user speech]
Server → Client: interruption event
Client: [flushes audio buffer, stops playback]
Server → Client: agent_response_correction { text: "The weather is 75" }
VAD → Server: user audio stream begins

This requires persistent WebSocket connections. REST APIs cannot handle bidirectional real-time state.

Voice Activity Detection: The Sensitivity Paradox

VAD (Voice Activity Detection) is the gatekeeper. It decides when the user is speaking and when they've stopped.

Get it wrong in either direction and you destroy the experience:

Too Sensitive:

Background noise triggers transcription
Coughs interpreted as speech
Agent interrupts itself constantly
User learns to be unnaturally quiet

Too Insensitive:

User has to shout
Soft-spoken users can't interact
Agent talks over user
Turn-taking breaks down

Most systems expose a vad_sensitivity parameter (0.0 to 1.0). There's no universal "correct" setting—it depends on:

Expected ambient noise (call center vs. quiet room)
User population (varying voice volumes)
Microphone quality
Use case (quick queries vs. extended conversation)

The Turn-Taking Logic

Once VAD detects speech, the system needs to know when the user has finished speaking. This is turn-taking.

The mechanism: silence threshold. When the user stops producing audio for N milliseconds (typically 500ms), the system commits the audio buffer to ASR.

Too short: Commits mid-thought, user gets interrupted Too long: Awkward delays, user waiting for response

Some systems use "Turn Eagerness"—semantic analysis of partial transcripts to predict turn completion. If the ASR output ends with a complete sentence, commit sooner. If it ends mid-phrase, wait longer.

Streaming: The Only Option

The latency budget makes one thing clear: you cannot wait for complete responses.

Approach	User Experience
Generate complete → Send complete → Play complete	2-5 second wait, then response
Stream generation → Buffer → Play buffered	1-2 second wait, then response
Stream generation → Stream play	200-400ms wait, then continuous audio

The only viable architecture for conversational voice AI is full streaming:

LLM streams text tokens as generated
TTS processes chunks in parallel (sentence or clause boundaries)
Audio streams to client as each chunk completes
Playback begins with first chunk while later chunks generate

This overlaps latency. While chunk N plays, chunk N+1 is synthesizing, and chunk N+2 is generating. First-byte latency drops dramatically.

The trade-off: you commit to words before you see the complete response. Corrections become impossible—the user already heard it.

The Network Problem

The hardest latency to control is network. You can optimize models. You can't optimize the internet.

Typical network latency:

Same region: 20-50ms
Cross-region: 50-150ms
Cross-continent: 150-300ms
Poor mobile connection: 200-500ms+
With jitter: +0-200ms variance

A voice AI system optimized for 300ms total latency in US-West will fail in Singapore. A system that works on WiFi will fail on 4G in an elevator.

Mitigation Strategies

Edge deployment: Run inference closer to users. ElevenLabs competitors like Vapi use edge infrastructure (Cloudflare Workers) specifically to minimize network hops.

Connection persistence: WebSockets avoid TCP handshake overhead on each message. Keep connections alive.

Predictive buffering: Pre-generate likely responses (greetings, confirmations) and cache at edge.

Graceful degradation: Monitor latency. If exceeding threshold, switch to text or acknowledge delay ("Let me think about that...").

The Quality-Latency Trade-Off

Everything in voice AI is a trade-off against the 500ms threshold:

Want	Cost
Better voice quality	Higher TTS latency
More nuanced responses	Higher LLM latency
Better transcription	Higher ASR latency
Global availability	Higher network latency
Cheaper inference	Smaller models = lower quality

The successful voice AI products are the ones that make these trade-offs consciously rather than discovering them in production.

Measurement and Monitoring

You can't improve what you don't measure. Critical metrics:

Time to First Byte (TTFB): Time from user speech end to first audio byte received. Target: <400ms.

End-to-End Latency: Time from user speech end to agent speech start. Target: <500ms.

Latency by Percentile: P50 latency is vanity. P95 latency is reality. If your P95 is 2 seconds, 5% of users are having a terrible experience.

Jitter: Variance in latency. Consistent 600ms is better than alternating 300ms/900ms.

Interruption Success Rate: How often barge-in correctly stops the agent and captures user intent.

The Architectural Takeaway

Voice AI is a latency discipline. Every component choice, every architectural decision, every deployment configuration flows from one constraint: the 500ms threshold.

The companies winning in voice AI aren't necessarily the ones with the best models. They're the ones who've optimized every millisecond in the pipeline—streaming everything, deploying at edge, handling interrupts cleanly, measuring obsessively.

Build for sub-500ms or don't build voice AI.

See also: ElevenLabs Infrastructure for the platform powering most voice AI, and Probabilistic Stack for engineering non-deterministic systems.

The 500ms Threshold: Why Latency Kills Voice AI

The 500ms Threshold: Why Latency Kills Voice AI

The Conversation Contract

The Latency Budget

Why Traditional TTS Fails

The Barge-In Problem

The Naive Approach (Fails)

The Correct Architecture

The Protocol

Voice Activity Detection: The Sensitivity Paradox

The Turn-Taking Logic

Streaming: The Only Option

The Network Problem

Mitigation Strategies

The Quality-Latency Trade-Off

Measurement and Monitoring

The Architectural Takeaway

Related

Ask a follow-up