The 500ms Threshold: Why Latency Kills Voice AI
The Conversation Contract
Text chat is forgiving. Users wait seconds for a response—they're typing, thinking, context-switching. Voice is not forgiving. Voice is a conversation, and conversations have a contract: respond within 500 milliseconds or the illusion breaks.
This isn't arbitrary. It's psychoacoustic. Human conversation operates on turn-taking rhythms measured in hundreds of milliseconds. When you ask someone a question, the gap before their response signals whether they're thinking, confused, or absent. A 300ms pause feels natural. A 700ms pause feels like something's wrong. A 1.5-second pause makes you wonder if the connection dropped.
Voice AI inherits this contract. You can't negotiate with user expectations shaped by millions of years of evolution.
The Latency Budget
Every voice AI interaction runs a latency gauntlet:
| Stage | Best Case | Typical | Notes |
|---|---|---|---|
| User audio capture | 0ms | 50-100ms | VAD detection delay |
| Network (up) | 20ms | 50-150ms | Highly variable |
| ASR (Speech-to-Text) | 75ms | 150-300ms | Scribe: ~150ms streaming |
| LLM inference | 100ms | 200-1000ms+ | Model dependent |
| TTS (Text-to-Speech) | 75ms | 100-500ms | Flash v2.5: ~75ms |
| Network (down) | 20ms | 50-150ms | Highly variable |
| Audio playback start | 0ms | 50ms | Buffer loading |
Best case total: ~300ms Typical total: 600-1500ms Acceptable threshold: <500ms
The math is brutal. Even with perfectly optimized components, network variability alone can push you over the threshold.
Why Traditional TTS Fails
The original sin of voice AI was building TTS for quality, not speed.
Traditional high-fidelity models like ElevenLabs Multilingual v2 produce beautiful audio—nuanced prosody, emotional range, breath simulation. They also take 1-2 seconds to generate a sentence. In a dubbing workflow where you're producing an audiobook, this is fine. In a conversation, it's death.
This created a forced choice: robot voices (fast) or human-like voices (slow).
ElevenLabs' Flash v2.5 model exists specifically to break this trade-off. Through aggressive knowledge distillation—training a smaller, faster "student" model from the larger "teacher" model—they achieved ~75ms latency while preserving acceptable quality.
The latency hierarchy:
| Model | Latency | Quality | Use Case |
|---|---|---|---|
| Multilingual v2 | 1000ms+ | Excellent | Audiobooks, dubbing |
| Turbo v2.5 | ~250ms | Good | Interactive apps |
| Flash v2.5 | ~75ms | Acceptable | Conversation |
Flash sacrifices prosodic range for raw speed. The voice sounds slightly less "alive"—but it responds fast enough to feel present.
The Barge-In Problem
Here's where things get architecturally complex.
In text chat, you wait for the model to finish before responding. In voice, users interrupt. They ask a question, the agent starts answering, and three words in the user says "wait, actually—" and changes the question.
This is called "barge-in," and it breaks everything if you don't handle it correctly.
The Naive Approach (Fails)
- Agent generates complete response
- Agent plays audio
- User interrupts mid-playback
- System... ignores the interruption? Plays over the user? Crashes?
The Correct Architecture
- Agent streams response in chunks
- VAD continuously monitors for user speech
- On user speech detection →
interruptionevent fires - Client immediately flushes audio buffer (agent stops talking)
- Server generates
agent_response_correctionevent
That last step is critical. The correction event contains what the agent actually said before being cut off.
Why does this matter? Context integrity.
If the agent intended to say "The weather is 75 degrees and sunny" but was cut off after "The weather is 75," the LLM's context must reflect that the user only heard "The weather is 75." Without this correction, the model might assume the user knows it's sunny—a false premise that compounds into hallucination.
The Protocol
Server → Client: audio chunk 1 ("The weather is...")
Server → Client: audio chunk 2 ("seventy-five...")
Client: [playing audio]
VAD: [detects user speech]
Server → Client: interruption event
Client: [flushes audio buffer, stops playback]
Server → Client: agent_response_correction { text: "The weather is 75" }
VAD → Server: user audio stream begins
This requires persistent WebSocket connections. REST APIs cannot handle bidirectional real-time state.
Voice Activity Detection: The Sensitivity Paradox
VAD (Voice Activity Detection) is the gatekeeper. It decides when the user is speaking and when they've stopped.
Get it wrong in either direction and you destroy the experience:
Too Sensitive:
- Background noise triggers transcription
- Coughs interpreted as speech
- Agent interrupts itself constantly
- User learns to be unnaturally quiet
Too Insensitive:
- User has to shout
- Soft-spoken users can't interact
- Agent talks over user
- Turn-taking breaks down
Most systems expose a vad_sensitivity parameter (0.0 to 1.0). There's no universal "correct" setting—it depends on:
- Expected ambient noise (call center vs. quiet room)
- User population (varying voice volumes)
- Microphone quality
- Use case (quick queries vs. extended conversation)
The Turn-Taking Logic
Once VAD detects speech, the system needs to know when the user has finished speaking. This is turn-taking.
The mechanism: silence threshold. When the user stops producing audio for N milliseconds (typically 500ms), the system commits the audio buffer to ASR.
Too short: Commits mid-thought, user gets interrupted Too long: Awkward delays, user waiting for response
Some systems use "Turn Eagerness"—semantic analysis of partial transcripts to predict turn completion. If the ASR output ends with a complete sentence, commit sooner. If it ends mid-phrase, wait longer.
Streaming: The Only Option
The latency budget makes one thing clear: you cannot wait for complete responses.
| Approach | User Experience |
|---|---|
| Generate complete → Send complete → Play complete | 2-5 second wait, then response |
| Stream generation → Buffer → Play buffered | 1-2 second wait, then response |
| Stream generation → Stream play | 200-400ms wait, then continuous audio |
The only viable architecture for conversational voice AI is full streaming:
- LLM streams text tokens as generated
- TTS processes chunks in parallel (sentence or clause boundaries)
- Audio streams to client as each chunk completes
- Playback begins with first chunk while later chunks generate
This overlaps latency. While chunk N plays, chunk N+1 is synthesizing, and chunk N+2 is generating. First-byte latency drops dramatically.
The trade-off: you commit to words before you see the complete response. Corrections become impossible—the user already heard it.
The Network Problem
The hardest latency to control is network. You can optimize models. You can't optimize the internet.
Typical network latency:
- Same region: 20-50ms
- Cross-region: 50-150ms
- Cross-continent: 150-300ms
- Poor mobile connection: 200-500ms+
- With jitter: +0-200ms variance
A voice AI system optimized for 300ms total latency in US-West will fail in Singapore. A system that works on WiFi will fail on 4G in an elevator.
Mitigation Strategies
Edge deployment: Run inference closer to users. ElevenLabs competitors like Vapi use edge infrastructure (Cloudflare Workers) specifically to minimize network hops.
Connection persistence: WebSockets avoid TCP handshake overhead on each message. Keep connections alive.
Predictive buffering: Pre-generate likely responses (greetings, confirmations) and cache at edge.
Graceful degradation: Monitor latency. If exceeding threshold, switch to text or acknowledge delay ("Let me think about that...").
The Quality-Latency Trade-Off
Everything in voice AI is a trade-off against the 500ms threshold:
| Want | Cost |
|---|---|
| Better voice quality | Higher TTS latency |
| More nuanced responses | Higher LLM latency |
| Better transcription | Higher ASR latency |
| Global availability | Higher network latency |
| Cheaper inference | Smaller models = lower quality |
The successful voice AI products are the ones that make these trade-offs consciously rather than discovering them in production.
Measurement and Monitoring
You can't improve what you don't measure. Critical metrics:
Time to First Byte (TTFB): Time from user speech end to first audio byte received. Target: <400ms.
End-to-End Latency: Time from user speech end to agent speech start. Target: <500ms.
Latency by Percentile: P50 latency is vanity. P95 latency is reality. If your P95 is 2 seconds, 5% of users are having a terrible experience.
Jitter: Variance in latency. Consistent 600ms is better than alternating 300ms/900ms.
Interruption Success Rate: How often barge-in correctly stops the agent and captures user intent.
The Architectural Takeaway
Voice AI is a latency discipline. Every component choice, every architectural decision, every deployment configuration flows from one constraint: the 500ms threshold.
The companies winning in voice AI aren't necessarily the ones with the best models. They're the ones who've optimized every millisecond in the pipeline—streaming everything, deploying at edge, handling interrupts cleanly, measuring obsessively.
Build for sub-500ms or don't build voice AI.
See also: ElevenLabs Infrastructure for the platform powering most voice AI, and Probabilistic Stack for engineering non-deterministic systems.