← all posts

We Tried to Break Four Voice Agents with a Cough. We Failed.

We ran the same 400 ms cough into four production voice-agent stacks — OpenAI Realtime default and tuned, cascaded Deepgram Nova-3, cascaded ElevenLabs Scribe v2. Four for four absorbed it without yielding. Here are the clips, and the engineering that explains why.

The canonical voice-agent-is-broken anecdote goes like this: a user coughs, the agent thinks they barged in, it stops mid-sentence, apologizes, and the thread is lost. This is the story everyone tells at conferences. So we tried to reproduce it. We fed the same 400 millisecond cough — attenuated 6 dB, fired 1.2 seconds into the agent’s response — into four production stacks at once, in parallel LiveKit rooms, and recorded the listener’s side of each call.

Nothing broke. Four for four absorbed the cough and kept answering. Listen:

OpenAI Realtime — stock VAD

Waveform — OpenAI Realtime, stock VAD

gpt-realtime with default server VAD (threshold 0.5, silence 500 ms). Cough fires ~1.2 s into the agent’s response. — Agent absorbs the cough and keeps speaking.

OpenAI Realtime — tuned VAD

Waveform — OpenAI Realtime, tuned VAD

Same model, VAD bumped to threshold 0.65, silence_duration_ms 800. — No observable difference from stock at this stimulus level.

Cascaded — Deepgram Nova-3

Waveform — cascaded Deepgram Nova-3

STT: Deepgram Nova-3 · LLM: GPT-4o-mini · TTS: ElevenLabs Flash v2.5. — Cascaded stack with a different STT also holds together.

Cascaded — ElevenLabs Scribe v2

Waveform — cascaded ElevenLabs Scribe v2

Same pipeline, STT swapped to ElevenLabs Scribe v2 Realtime. — Four-for-four. Nothing broke at a 400 ms cough, −6 dB, 1.2 s into the response.

Same caller audio, same cough stimulus, same timing. Four different stacks. Zero false interrupts.

This isn’t an indictment of the anecdote. It is an artifact of how aggressively modern voice-agent platforms have engineered around it. The reason four stacks with very different internals all shrugged off the same cough is that every one of them ships with the same stack of defenses bolted on: hysteresis on the voice-activity detector, an acoustic echo canceller with a double-talk detector, a minimum-duration gate before an interrupt fires. Take any one of those out and the cough breaks the agent instantly. Leave them in — which everyone does now — and you have to work much harder than a single cough to cause a false interrupt.

The three defenses

VAD with hysteresis. The voice-activity detector doesn’t flip to “user is speaking” on a single frame. Silero (or OpenAI’s server VAD) requires ~80 ms of continuous high-confidence speech frames before declaring an interrupt. A 400 ms cough has the wrong envelope — a sharp attack and immediate decay — so it doesn’t sustain the threshold. WebRTC’s 2011 GMM VAD would have fired on it. Modern DNN VADs don’t.

Acoustic echo cancellation with double-talk detection. The agent’s own TTS coming through the caller’s speaker and looping back through the mic would trigger a self-interrupt every few seconds without AEC. Every real voice-agent platform ships this on by default — WebRTC’s AEC3 in the browser, server-side AEC in LiveKit/Daily, proprietary implementations in Retell and Vapi. The cough survived it because the cough came from the mic, not the reference signal; AEC had nothing to subtract.

Interrupt debounce before the TTS cancels. Even if the VAD fires, cascaded platforms wait for the agent’s transcription to return a non-empty token before canceling TTS. That round-trip through STT plus the provider’s turn-taking model adds another 200–400 ms buffer. A cough transcribes to silence or garbage, so the interrupt doesn’t actually land.

Each defense handles a different category of false positive. Remove the VAD hysteresis and every sharp noise triggers. Remove the AEC and the agent talks over itself. Remove the debounce and you get interrupts on throat-clears. Everyone building voice agents in 2026 ships all three, which is why you have to work much harder than a single cough to actually break the system.

The minimum viable stack

If you’re building on top of OpenAI Realtime, Gemini Live, Retell, Vapi, or any LiveKit-based stack, you inherit these defenses. If you’re rolling raw voice infrastructure, you need all of them yourself:

  • Silero VAD with ~80 ms onset hysteresis, ~400 ms offset hysteresis
  • AEC with double-talk detection — WebRTC AEC3 in browser, LiveKit/Daily server-side, or a telephony SDK at the media layer
  • Interrupt debounce — wait for a non-empty STT token before cancelling TTS
  • TTS cancellation within one audio frame of the interrupt decision; preserve the unspoken tail in conversation state
  • Labeled replay fixture you run on every deploy — coughs, throat-clears, keyboard, TV, confederate-interrupts

The signal chain:

Mic 16 kHz → AEC + double-talk detection → noise suppression → Silero VAD (80 / 400 ms)
  → STT → Agent state → TTS → Speaker

  · TTS feeds back into AEC as the reference signal
  · Agent interrupt cancels TTS within one audio frame, preserving the unspoken tail

AEC with double-talk detection removes echo before noise suppression, Silero VAD gates the mic, and the interrupt path cancels TTS within one audio frame while preserving the unspoken tail for the next turn.

We benchmark these behaviors across the major providers at speko.ai — latency, false-trigger rates, and recovery quality by use case. If you are shipping voice and getting burned by false interrupts, that is where to start.