Frontier models advertise sub-200ms text-to-speech. Voice synthesis is indistinguishable from human. Yet voice assistants still feel robotic.
The problem isn't the voice. It's the timing.
The uncanny valley of voice AI has shifted. It's no longer about synthetic intonation or flat affect. It's about the awkward pauses, the accidental interruptions, the inability to seamlessly trade the "floor" of conversation. We've solved generation. We haven't solved turn-taking.
The Psychophysics of Conversation
Research in Conversation Analysis reveals something counterintuitive: the average gap between human turns is approximately 200 milliseconds. This is startlingly short—significantly shorter than the 600-1500ms required to cognitively plan a response.
The implication is profound. Humans don't wait for silence to begin formulating replies. We engage in projective turn-taking: parsing the incoming speech stream in real-time, predicting when the sentence will end, and preparing our response in parallel.
We detect "Transition Relevance Places" (TRPs)—points where a turn could legitimately end—using a fusion of signals:
- Syntactic completeness: Recognizing a grammatical structure is resolving
- Prosodic contours: Pitch declination, syllable stretching, volume drop
- Pragmatic cues: Eye gaze, rising pitch for questions, social context
Traditional voice AI systems can't do this. They operate reactively: wait for silence, wait for a timeout to confirm the turn is over, transcribe, generate, synthesize. This serial pipeline creates latencies of 800ms to over 3 seconds.
The user experience shifts from "conversing with an agent" to "operating a machine."
The Three Failure Modes
Effective turn management isn't about detecting silence. It's about managing the "floor"—the acknowledged right to speak. Voice AI fails in three distinct ways:
1. False Cut-off (Early Endpointing)
The user pauses to think, and the system interprets this as a yield. It interrupts the user's thought process. This is the most destructive error for trust—it signals the system isn't listening, merely waiting for a gap.
2. Delayed Response (Late Endpointing)
The system waits too long after the user finishes. In human conversation, a gap longer than 1 second signals trouble—disagreement, confusion, transmission failure. Users respond by asking "Are you there?" or repeating themselves, which collides with the system's eventual response.
3. Barge-in Failure
The user attempts to interrupt ("No, I meant Tuesday") but the system continues speaking. Usually caused by lack of full-duplex listening or ineffective echo cancellation. The system ignores the user's urgent control signals.
Voice Activity Detection: The Foundation
Voice Activity Detection (VAD) distinguishes speech from silence and noise. Simple in concept; complex in practice.
Classical VAD
Energy thresholding: If amplitude exceeds a threshold, classify as speech. Fails catastrophically in cafés, cars, or anywhere with non-stationary noise.
Zero-crossing rate: Counts how often the signal crosses zero. Speech (especially fricatives) has high ZCR; periodic noise has low ZCR. Better, but still fragile.
Spectral features + GMMs: WebRTC VAD analyzes frequency characteristics. More robust, but struggles with "voice-like" noise—music, TV, background chatter.
Neural VAD
Deep learning changed the game. Modern VADs like Silero VAD learn to distinguish speech from noise by training on thousands of hours of diverse audio.
At 5% False Positive Rate:
- WebRTC VAD: ~50% True Positive Rate
- Silero VAD: 87.7% True Positive Rate
- Picovoice Cobra: 98.9% True Positive Rate
The difference matters in production. WebRTC might miss soft speech or trigger constantly on typing sounds. Silero runs on a single CPU thread with sub-millisecond inference—viable for real-time applications without GPU.
The State Machine Problem
VAD outputs continuous probabilities. Converting these to discrete events—SPEECH_START and SPEECH_END—requires a state machine with two critical parameters:
Prefix padding (pre-roll): To catch initial consonants that might be low-energy (the /h/ in "Hello"), systems maintain a rolling buffer. When speech is detected, this buffer is prepended to the audio.
Hangover time: How long to wait after speech energy drops before declaring SPEECH_END.
- Short hangover (200ms): Snappy, but cuts off users mid-thought
- Long hangover (1000ms): Safe, but adds unavoidable latency to every turn
This is the fundamental tradeoff. Every millisecond of hangover adds directly to response latency—unless you use smarter approaches.
The Evolution of Endpointing
Endpointing decides if a pause signifies turn completion. It's distinct from VAD: VAD detects silence; endpointing detects completion.
Generation 1: Fixed Timeout
if (silence_duration > T) then Endpoint
Typical values: 700-1500ms. Context-blind. Treats a pause after "I want to book a flight." exactly like a pause in "I want to book a..." Result: walkie-talkie interaction.
Generation 2: Adaptive Timeout
Adjust T based on transcript content:
- Short transcript ("Yes", "No") → lower threshold (300ms)
- Ends in connector ("and", "but", "because") → extend threshold (1500ms)
Better, but still reactive. Struggles with variable speech rates.
Generation 3: Acoustic & Prosodic
Analyze how things are said, not just what:
- Pitch declination: F0 drops at declarative sentence endings
- Phrase-final lengthening: Last syllable stretches
- Intensity drop: Volume decreases
RNN-Transducers trained on acoustic frames can detect End-of-Utterance (EOU) based on these features. Amazon Alexa and Google Assistant shaved hundreds of milliseconds by differentiating "thinking pauses" (flat pitch, sustained vowel) from "final pauses" (dropped pitch).
Generation 4: Semantic Endpointing (Current SOTA)
Integrate real-time ASR with language understanding to determine if the utterance is grammatically and semantically complete.
- "I want to order a..." → Incomplete → Extend timeout
- "I want to order a pizza." → Complete → Endpoint immediately, even with short silence
Commercial implementations:
- Deepgram UtteranceEnd: Joint acoustic-text analysis triggers when semantic completeness detected
- OpenAI Realtime API:
semantic_vadmode with configurable "eagerness" - AssemblyAI Universal-Streaming: Markets semantic endpointing as replacement for silence-based methods
Generation 5: Predictive Turn-Taking (Frontier)
Instead of detecting ends, predict them.
Voice Activity Projection (VAP): Project speech probability into the future (next 2 seconds). Identify upcoming TRPs and prepare responses before the user stops—enabling "latching" with no-gap transitions.
Multimodal fusion: In video-enabled contexts, gaze aversion strongly signals "thinking" while gaze return signals yield. Audio-visual models reduce false cut-offs significantly.
Full-duplex models (Moshi): Audio-in, audio-out without text transcription. Models user and system as parallel streams. Handles interruptions and backchannels organically—no explicit endpointing logic required.
The Latency Budget
Latency defines the quality of turn management. A system with perfect understanding but 3-second latency is unusable.
Target: Sub-800ms for natural feel; sub-500ms for "interruptible" conversation.
Anatomy of Delay
| Stage | Typical Latency | Optimization |
|---|---|---|
| VAD + Ingest | 200-500ms | Reduce hangover; semantic VAD |
| Network (up) | 20-100ms | Edge servers; WebRTC over WebSocket |
| ASR | 100-300ms | Streaming ASR; start LLM on interim results |
| LLM TTFT | 200-600ms | Smaller models; speculative decoding |
| TTS | 100-300ms | Streaming TTS; start on first sentence |
| Network (down) | 20-100ms | Edge caching; WebRTC data channels |
| Client buffer | 20-50ms | Adaptive jitter buffering |
| Total | 660-1950ms | Target: <800ms |
Provider Benchmarks (2025)
- OpenAI GPT-4o Realtime: ~320ms average. Massive reduction from 5.4s of previous pipelines. Single integrated model eliminates serialization overhead.
- Retell AI: ~420ms average, 380ms best case. Optimized for telephony.
- Deepgram Flux: 200-600ms savings via "Eager" mode—speculative endpointing 150-250ms before confirmation.
- Twilio Voice: ~950ms. PSTN latency floor plus WebSocket overhead.
- ElevenLabs: Sub-200ms TTS time-to-first-byte. End-to-end depends on ASR and LLM.
Masking Perceived Latency
Physical latency has hard limits. Psychological tricks mask the delay:
Filler injection: Immediately play "Hmm," "Let me see," or a thinking sound upon endpoint detection. Acknowledges the turn instantly (<200ms) while LLM generates substance. Keeps the floor.
Backchanneling: Insert "Yeah," "Right," "Uh-huh" while user speaks. Signals active listening.
Speculative execution: Trigger LLM generation before VAD makes a final decision. If VAD later decides the turn isn't over, discard the speculative response. Burns compute, saves critical milliseconds.
Transport Protocols
The choice between WebSocket and WebRTC is the most consequential architectural decision.
| Feature | WebSocket | WebRTC |
|---|---|---|
| Protocol | TCP | UDP |
| Lost packets | Retransmit (head-of-line blocking) | Skip (stream continues) |
| Latency behavior | Variable jitter, audio glitches | Consistent timing, minor artifacts |
| Media features | None (raw bytes) | Built-in AEC, noise suppression, AGC |
Verdict: WebRTC for client-side voice AI. Built-in Acoustic Echo Cancellation is what makes barge-in possible in browsers. WebSockets work for server-to-server telephony where networks are stable.
LiveKit Architecture
LiveKit has emerged as the dominant infrastructure layer, abstracting WebRTC complexity:
- Client (WebRTC) → LiveKit SFU → Agent Framework
- Native Silero VAD integration
- Handles packet loss, jitter buffering, echo cancellation
- Developers focus on agent logic, not media transport
Telephony (Twilio)
Twilio Media Streams use WebSockets for mulaw audio. PSTN overhead creates ~500ms latency floor. Barge-in is harder—server doesn't have a clean reference signal for echo cancellation.
Interruption Handling (Barge-In)
The ability to interrupt—"Wait, stop" or "I meant Tuesday"—separates voice bots from conversational agents.
Acoustic Echo Cancellation
If the AI is speaking, its audio plays from the user's speakers and re-enters the microphone. Without AEC, the AI "hears itself" and interrupts itself.
Browser/WebRTC: Excellent built-in AEC. The client knows exactly when audio played and can subtract with millisecond precision.
Telephony: Extremely difficult. The server knows when it sent audio, not when it played. Network jitter makes precise subtraction impossible. Telephony bots often use half-duplex behavior (mute mic while speaking) or aggressive noise gating.
Clear Buffer Logic
When barge-in is detected:
- Stop synthesis: Halt TTS immediately
- Clear output buffer: Discard audio generated but not yet played
- State rewind: Determine what the user actually heard. If AI generated 10 seconds but was interrupted at second 2, context for the next turn must only include those 2 seconds.
This requires precise timestamping of every word sent.
Semantic Filtering
Not all speech during AI output is interruption. Users often backchannel ("uh-huh", "right") to signal agreement without wanting the floor.
Heuristic: If speech_duration < 500ms AND transcript in ["yes", "ok", "mhm"] → Continue speaking. Otherwise → Yield floor.
Backchannels
Backchannels are short, often non-lexical utterances ("mhm", "uh-huh") that signal attention without demanding a turn.
Handling Incoming Backchannels
Challenge: Don't mistake them for interruptions.
Solution: Semantic filter on VAD output. If transcript matches backchannel words and duration is short, suppress the interruption event.
Generating Backchannels
To sound human, the AI should occasionally backchannel while users speak long paragraphs. This is rare in production because it requires:
- Extremely low latency
- Precise timing (during brief pause or pitch valley)
- Risk management (wrong timing sounds like interruption)
End-to-end models like Moshi generate backchannels organically by modeling the "listener" state as an active process.
Production Implementations
OpenAI Realtime API
Integrated ASR+LLM+TTS in a single stateful WebSocket.
- VAD options:
server_vad(energy/silence) orsemantic_vad(model-based) - Interruption: Automatic truncation with
response.cancelevent - Configuration:
prefix_padding_ms,silence_duration_ms,eagernesslevels
Deepgram Flux
Specialized for voice agents, decoupled from LLM.
- UtteranceEnd: Semantic completeness, not just silence
- Eager mode: Probabilistic early signal to start LLM 150-250ms before confirmation
- Claims 200-600ms latency reduction
LiveKit Agents
Orchestration framework—bring your own models.
- Plug in Deepgram (STT), OpenAI (LLM), ElevenLabs (TTS)
- Silero VAD runs locally on agent server
- Configurable
min_endpointing_delay - Custom turn detector models supported
Hume EVI
Empathic Voice Interface with emotion-aware turn-taking.
- Prosody model: Analyzes tone, rhythm, timbre
- Adaptive behavior: Longer timeout for angry users (let them vent); quicker response to energetic users
The Frontier
End-to-End Speech Models
The traditional cascade (VAD → ASR → LLM → TTS) accumulates latency at every step. End-to-end models collapse the pipeline.
Moshi (Kyutai) and GPT-4o Audio: Input raw audio tokens, output raw audio tokens. No text transcription in the middle. Can learn non-verbal cues (breath intake, intonation) directly from audio. True full-duplex operation.
Predictive Turn-Taking
Voice Activity Projection predicts where turns will end, not where they did. Prepare responses proactively. Synthesize before the user has completely stopped.
Visual Turn Cues
For video-enabled agents, gaze direction is a stronger turn predictor than prosody in many contexts. Integrated audio-visual VADs significantly reduce false cut-offs by detecting whether the user is still "engaged" during silence.
The Key Takeaways
For Engineers Building Voice AI:
- Don't rely on silence alone. Semantic endpointing differentiates thinking pauses from completion pauses.
- Optimize for barge-in. WebRTC for client-side AEC. Implement state rewinding for telephony.
- Tune VAD dynamically. Short timeout for yes/no questions; long timeout for open-ended.
- Monitor latency distributions. A single 2-second delay breaks the illusion of presence.
- Use speculative generation. Trade compute for lower perceived latency.
The uncanny valley of voice AI is closing—not through better voices, but through better timing. Silence is a fundamentally flawed proxy for turn completion. The systems winning today understand not just what users say, but when they're done saying it.