MMNTM logo
Return to Index
Technical Deep Dive

The Turn-Taking Problem: Why Voice AI Still Feels Robotic

The engineering behind making machines talk in conversation—beyond TTS quality to the temporal dynamics that make or break natural voice interaction.

MMNTM Research
15 min read
#voice#infrastructure#latency#production#real-time

Frontier models advertise sub-200ms text-to-speech. Voice synthesis is indistinguishable from human. Yet voice assistants still feel robotic.

The problem isn't the voice. It's the timing.

The uncanny valley of voice AI has shifted. It's no longer about synthetic intonation or flat affect. It's about the awkward pauses, the accidental interruptions, the inability to seamlessly trade the "floor" of conversation. We've solved generation. We haven't solved turn-taking.


The Psychophysics of Conversation

Research in Conversation Analysis reveals something counterintuitive: the average gap between human turns is approximately 200 milliseconds. This is startlingly short—significantly shorter than the 600-1500ms required to cognitively plan a response.

The implication is profound. Humans don't wait for silence to begin formulating replies. We engage in projective turn-taking: parsing the incoming speech stream in real-time, predicting when the sentence will end, and preparing our response in parallel.

We detect "Transition Relevance Places" (TRPs)—points where a turn could legitimately end—using a fusion of signals:

  • Syntactic completeness: Recognizing a grammatical structure is resolving
  • Prosodic contours: Pitch declination, syllable stretching, volume drop
  • Pragmatic cues: Eye gaze, rising pitch for questions, social context

Traditional voice AI systems can't do this. They operate reactively: wait for silence, wait for a timeout to confirm the turn is over, transcribe, generate, synthesize. This serial pipeline creates latencies of 800ms to over 3 seconds.

The user experience shifts from "conversing with an agent" to "operating a machine."


The Three Failure Modes

Effective turn management isn't about detecting silence. It's about managing the "floor"—the acknowledged right to speak. Voice AI fails in three distinct ways:

1. False Cut-off (Early Endpointing)

The user pauses to think, and the system interprets this as a yield. It interrupts the user's thought process. This is the most destructive error for trust—it signals the system isn't listening, merely waiting for a gap.

2. Delayed Response (Late Endpointing)

The system waits too long after the user finishes. In human conversation, a gap longer than 1 second signals trouble—disagreement, confusion, transmission failure. Users respond by asking "Are you there?" or repeating themselves, which collides with the system's eventual response.

3. Barge-in Failure

The user attempts to interrupt ("No, I meant Tuesday") but the system continues speaking. Usually caused by lack of full-duplex listening or ineffective echo cancellation. The system ignores the user's urgent control signals.


Voice Activity Detection: The Foundation

Voice Activity Detection (VAD) distinguishes speech from silence and noise. Simple in concept; complex in practice.

Classical VAD

Energy thresholding: If amplitude exceeds a threshold, classify as speech. Fails catastrophically in cafés, cars, or anywhere with non-stationary noise.

Zero-crossing rate: Counts how often the signal crosses zero. Speech (especially fricatives) has high ZCR; periodic noise has low ZCR. Better, but still fragile.

Spectral features + GMMs: WebRTC VAD analyzes frequency characteristics. More robust, but struggles with "voice-like" noise—music, TV, background chatter.

Neural VAD

Deep learning changed the game. Modern VADs like Silero VAD learn to distinguish speech from noise by training on thousands of hours of diverse audio.

At 5% False Positive Rate:

  • WebRTC VAD: ~50% True Positive Rate
  • Silero VAD: 87.7% True Positive Rate
  • Picovoice Cobra: 98.9% True Positive Rate

The difference matters in production. WebRTC might miss soft speech or trigger constantly on typing sounds. Silero runs on a single CPU thread with sub-millisecond inference—viable for real-time applications without GPU.

The State Machine Problem

VAD outputs continuous probabilities. Converting these to discrete events—SPEECH_START and SPEECH_END—requires a state machine with two critical parameters:

Prefix padding (pre-roll): To catch initial consonants that might be low-energy (the /h/ in "Hello"), systems maintain a rolling buffer. When speech is detected, this buffer is prepended to the audio.

Hangover time: How long to wait after speech energy drops before declaring SPEECH_END.

  • Short hangover (200ms): Snappy, but cuts off users mid-thought
  • Long hangover (1000ms): Safe, but adds unavoidable latency to every turn

This is the fundamental tradeoff. Every millisecond of hangover adds directly to response latency—unless you use smarter approaches.


The Evolution of Endpointing

Endpointing decides if a pause signifies turn completion. It's distinct from VAD: VAD detects silence; endpointing detects completion.

Generation 1: Fixed Timeout

if (silence_duration > T) then Endpoint

Typical values: 700-1500ms. Context-blind. Treats a pause after "I want to book a flight." exactly like a pause in "I want to book a..." Result: walkie-talkie interaction.

Generation 2: Adaptive Timeout

Adjust T based on transcript content:

  • Short transcript ("Yes", "No") → lower threshold (300ms)
  • Ends in connector ("and", "but", "because") → extend threshold (1500ms)

Better, but still reactive. Struggles with variable speech rates.

Generation 3: Acoustic & Prosodic

Analyze how things are said, not just what:

  • Pitch declination: F0 drops at declarative sentence endings
  • Phrase-final lengthening: Last syllable stretches
  • Intensity drop: Volume decreases

RNN-Transducers trained on acoustic frames can detect End-of-Utterance (EOU) based on these features. Amazon Alexa and Google Assistant shaved hundreds of milliseconds by differentiating "thinking pauses" (flat pitch, sustained vowel) from "final pauses" (dropped pitch).

Generation 4: Semantic Endpointing (Current SOTA)

Integrate real-time ASR with language understanding to determine if the utterance is grammatically and semantically complete.

  • "I want to order a..." → Incomplete → Extend timeout
  • "I want to order a pizza." → Complete → Endpoint immediately, even with short silence

Commercial implementations:

  • Deepgram UtteranceEnd: Joint acoustic-text analysis triggers when semantic completeness detected
  • OpenAI Realtime API: semantic_vad mode with configurable "eagerness"
  • AssemblyAI Universal-Streaming: Markets semantic endpointing as replacement for silence-based methods

Generation 5: Predictive Turn-Taking (Frontier)

Instead of detecting ends, predict them.

Voice Activity Projection (VAP): Project speech probability into the future (next 2 seconds). Identify upcoming TRPs and prepare responses before the user stops—enabling "latching" with no-gap transitions.

Multimodal fusion: In video-enabled contexts, gaze aversion strongly signals "thinking" while gaze return signals yield. Audio-visual models reduce false cut-offs significantly.

Full-duplex models (Moshi): Audio-in, audio-out without text transcription. Models user and system as parallel streams. Handles interruptions and backchannels organically—no explicit endpointing logic required.


The Latency Budget

Latency defines the quality of turn management. A system with perfect understanding but 3-second latency is unusable.

Target: Sub-800ms for natural feel; sub-500ms for "interruptible" conversation.

Anatomy of Delay

StageTypical LatencyOptimization
VAD + Ingest200-500msReduce hangover; semantic VAD
Network (up)20-100msEdge servers; WebRTC over WebSocket
ASR100-300msStreaming ASR; start LLM on interim results
LLM TTFT200-600msSmaller models; speculative decoding
TTS100-300msStreaming TTS; start on first sentence
Network (down)20-100msEdge caching; WebRTC data channels
Client buffer20-50msAdaptive jitter buffering
Total660-1950msTarget: <800ms

Provider Benchmarks (2025)

  • OpenAI GPT-4o Realtime: ~320ms average. Massive reduction from 5.4s of previous pipelines. Single integrated model eliminates serialization overhead.
  • Retell AI: ~420ms average, 380ms best case. Optimized for telephony.
  • Deepgram Flux: 200-600ms savings via "Eager" mode—speculative endpointing 150-250ms before confirmation.
  • Twilio Voice: ~950ms. PSTN latency floor plus WebSocket overhead.
  • ElevenLabs: Sub-200ms TTS time-to-first-byte. End-to-end depends on ASR and LLM.

Masking Perceived Latency

Physical latency has hard limits. Psychological tricks mask the delay:

Filler injection: Immediately play "Hmm," "Let me see," or a thinking sound upon endpoint detection. Acknowledges the turn instantly (<200ms) while LLM generates substance. Keeps the floor.

Backchanneling: Insert "Yeah," "Right," "Uh-huh" while user speaks. Signals active listening.

Speculative execution: Trigger LLM generation before VAD makes a final decision. If VAD later decides the turn isn't over, discard the speculative response. Burns compute, saves critical milliseconds.


Transport Protocols

The choice between WebSocket and WebRTC is the most consequential architectural decision.

FeatureWebSocketWebRTC
ProtocolTCPUDP
Lost packetsRetransmit (head-of-line blocking)Skip (stream continues)
Latency behaviorVariable jitter, audio glitchesConsistent timing, minor artifacts
Media featuresNone (raw bytes)Built-in AEC, noise suppression, AGC

Verdict: WebRTC for client-side voice AI. Built-in Acoustic Echo Cancellation is what makes barge-in possible in browsers. WebSockets work for server-to-server telephony where networks are stable.

LiveKit Architecture

LiveKit has emerged as the dominant infrastructure layer, abstracting WebRTC complexity:

  • Client (WebRTC) → LiveKit SFU → Agent Framework
  • Native Silero VAD integration
  • Handles packet loss, jitter buffering, echo cancellation
  • Developers focus on agent logic, not media transport

Telephony (Twilio)

Twilio Media Streams use WebSockets for mulaw audio. PSTN overhead creates ~500ms latency floor. Barge-in is harder—server doesn't have a clean reference signal for echo cancellation.


Interruption Handling (Barge-In)

The ability to interrupt—"Wait, stop" or "I meant Tuesday"—separates voice bots from conversational agents.

Acoustic Echo Cancellation

If the AI is speaking, its audio plays from the user's speakers and re-enters the microphone. Without AEC, the AI "hears itself" and interrupts itself.

Browser/WebRTC: Excellent built-in AEC. The client knows exactly when audio played and can subtract with millisecond precision.

Telephony: Extremely difficult. The server knows when it sent audio, not when it played. Network jitter makes precise subtraction impossible. Telephony bots often use half-duplex behavior (mute mic while speaking) or aggressive noise gating.

Clear Buffer Logic

When barge-in is detected:

  1. Stop synthesis: Halt TTS immediately
  2. Clear output buffer: Discard audio generated but not yet played
  3. State rewind: Determine what the user actually heard. If AI generated 10 seconds but was interrupted at second 2, context for the next turn must only include those 2 seconds.

This requires precise timestamping of every word sent.

Semantic Filtering

Not all speech during AI output is interruption. Users often backchannel ("uh-huh", "right") to signal agreement without wanting the floor.

Heuristic: If speech_duration < 500ms AND transcript in ["yes", "ok", "mhm"] → Continue speaking. Otherwise → Yield floor.


Backchannels

Backchannels are short, often non-lexical utterances ("mhm", "uh-huh") that signal attention without demanding a turn.

Handling Incoming Backchannels

Challenge: Don't mistake them for interruptions.

Solution: Semantic filter on VAD output. If transcript matches backchannel words and duration is short, suppress the interruption event.

Generating Backchannels

To sound human, the AI should occasionally backchannel while users speak long paragraphs. This is rare in production because it requires:

  • Extremely low latency
  • Precise timing (during brief pause or pitch valley)
  • Risk management (wrong timing sounds like interruption)

End-to-end models like Moshi generate backchannels organically by modeling the "listener" state as an active process.


Production Implementations

OpenAI Realtime API

Integrated ASR+LLM+TTS in a single stateful WebSocket.

  • VAD options: server_vad (energy/silence) or semantic_vad (model-based)
  • Interruption: Automatic truncation with response.cancel event
  • Configuration: prefix_padding_ms, silence_duration_ms, eagerness levels

Deepgram Flux

Specialized for voice agents, decoupled from LLM.

  • UtteranceEnd: Semantic completeness, not just silence
  • Eager mode: Probabilistic early signal to start LLM 150-250ms before confirmation
  • Claims 200-600ms latency reduction

LiveKit Agents

Orchestration framework—bring your own models.

  • Plug in Deepgram (STT), OpenAI (LLM), ElevenLabs (TTS)
  • Silero VAD runs locally on agent server
  • Configurable min_endpointing_delay
  • Custom turn detector models supported

Hume EVI

Empathic Voice Interface with emotion-aware turn-taking.

  • Prosody model: Analyzes tone, rhythm, timbre
  • Adaptive behavior: Longer timeout for angry users (let them vent); quicker response to energetic users

The Frontier

End-to-End Speech Models

The traditional cascade (VAD → ASR → LLM → TTS) accumulates latency at every step. End-to-end models collapse the pipeline.

Moshi (Kyutai) and GPT-4o Audio: Input raw audio tokens, output raw audio tokens. No text transcription in the middle. Can learn non-verbal cues (breath intake, intonation) directly from audio. True full-duplex operation.

Predictive Turn-Taking

Voice Activity Projection predicts where turns will end, not where they did. Prepare responses proactively. Synthesize before the user has completely stopped.

Visual Turn Cues

For video-enabled agents, gaze direction is a stronger turn predictor than prosody in many contexts. Integrated audio-visual VADs significantly reduce false cut-offs by detecting whether the user is still "engaged" during silence.


The Key Takeaways

For Engineers Building Voice AI:

  1. Don't rely on silence alone. Semantic endpointing differentiates thinking pauses from completion pauses.
  2. Optimize for barge-in. WebRTC for client-side AEC. Implement state rewinding for telephony.
  3. Tune VAD dynamically. Short timeout for yes/no questions; long timeout for open-ended.
  4. Monitor latency distributions. A single 2-second delay breaks the illusion of presence.
  5. Use speculative generation. Trade compute for lower perceived latency.

The uncanny valley of voice AI is closing—not through better voices, but through better timing. Silence is a fundamentally flawed proxy for turn completion. The systems winning today understand not just what users say, but when they're done saying it.

Voice Turn-Taking: The Engineering Behind Natural Voice AI