May 18, 2026

Speech-to-Speech Got Smart. It Still Can't Replace the Cascade.

Speech-to-speech models closed the reasoning gap with text LLMs in 2026. The gap that's left — observability, cost predictability, component swap — is the one that actually decides production architecture.

One model or three? Every voice-agent team hits this fork. The cascaded stack runs three models in sequence — speech-to-text, an LLM, then text-to-speech. A speech-to-speech (S2S) model collapses all three into one: audio in, audio out. S2S is faster, it sounds more human, and — the part that changed in 2026 — it is no longer dumber.

So we went looking for the reason not to switch. We found one. It just isn’t the reason most people would guess.

The reasoning gap closed — that’s the real news

For most of 2024 and 2025 the case against S2S was simple: it was dumber. Sending audio straight to audio skipped the text reasoning layer, and it showed on anything that required multi-step thinking.

That argument is now out of date. The top S2S models score 93–98% on Big Bench Audio, the standard audio-reasoning benchmark — level with frontier text LLMs. GPT-Realtime-2 ships GPT-5-class reasoning with adjustable effort. Grok Voice and Gemini’s native-audio models sit in the same band. If you are still telling people “S2S can’t reason,” you are working from old information.

Which raises the obvious question. If reasoning is solved, what is actually keeping cascaded alive?

The gap that didn’t close

Three things, and none of them are about intelligence.

Observability. A cascaded stack exposes exact text at every hop. You know precisely what the agent heard, what the LLM decided, and what it said — and you can inspect or block a response before the customer hears it. S2S is a black box. Audio never becomes text inside the pipeline, so the only transcript you get is one you generate afterward, by running STT on a recording, after the call is over. When an agent says something wrong, the first question every engineer and every compliance officer asks is “what did it actually say?” Cascaded answers instantly. S2S cannot. In healthcare, finance, and legal, that single fact ends the discussion.

Cost predictability. Cascaded cost is flat per turn and scales linearly with call length — roughly $0.0095–$0.17 per minute depending on the stack. S2S cost is harder to forecast: the full audio context re-accumulates every turn, so cost grows super-linearly as the call gets longer. OpenAI Realtime runs about $0.18–$0.46 per minute uncached, dropping to $0.05–$0.10 with prompt caching.

Here is the shape of it. Take a cascaded stack at $0.05 a minute. A 2-minute call costs about $0.10; a 12-minute call costs about $0.60 — six times the minutes, six times the bill. Predictable. Now run those same two calls on S2S. The short one is cheap — there is barely any conversation history to re-process. The long one is the trap: by the fortieth exchange, every turn feeds the audio of the previous thirty-nine back into the model as input, and pays for it again. The back half of a long call costs far more per minute than the front half, so the total lands well above any straight-line estimate from the short call. That $0.18–$0.46 figure is an average hiding a rising curve — budget off a two-minute demo and a twelve-minute real call will surprise you.

For a high-volume support line, predictable beats cheap-on-a-good-day.

Component swap. Cascaded lets you pick the best STT, the best LLM, and the best TTS independently, and swap any one of them when something better ships. S2S is all-or-nothing: one vendor for the entire pipeline. The lock-in is real, and the field of S2S vendors is short.

Where S2S genuinely wins

This is not a one-sided story. S2S wins three things outright.

Latency. S2S removes two network hops and two model invocations, and it shows: roughly 250–500 ms end-to-end versus 750–900 ms for a well-built streamed cascade. One honest note — vendors love to quote a 7× advantage, but they get there by comparing S2S against a non-streamed cascade, which is a strawman. A competent streamed cascade is sub-second. The real gap is closer to 2×. Still a clear win, just not the one on the slide.

Naturalness. Because audio never round-trips through text, S2S preserves prosody, emphasis, timing, and emotion. A cascade loses all of that at the STT boundary — the LLM only ever sees flat text. For coaching, companionship, language tutoring, and premium concierge support, that is a genuine moat.

Barge-in. S2S handles interruptions natively through server-side duplex turn detection. A cascade can match it, but only by bolting on an explicitly engineered VAD and turn-detection layer — the machinery we pulled apart in our barge-in test. S2S gets it for free.

When each wins

Dimension	S2S	Cascaded
End-to-end latency	~250–500 ms	~750–900 ms streamed
Naturalness / prosody	Wins — paralinguistics preserved	Flattened to text at STT
Reasoning	At parity (93–98% Big Bench Audio)	At parity
Barge-in / turn-taking	Native duplex	Engineered VAD + turn detection
Observability	Black box — post-hoc transcript only	Exact text at every stage
Cost on long calls	Super-linear growth	Flat per turn
Component swap	Locked to one vendor	Any STT / LLM / TTS
Production maturity	<15% adoption, rising	The 2026 default

The verdict

No — S2S cannot fully replace the cascaded stack today, not for production voice agents broadly.

The 2024 reasoning objection is dead. The operational objections are not: observability, cost predictability on long calls, component swap, and compliance-grade auditability all still favor cascaded. The pattern winning in production right now is hybrid — S2S for the fast, simple turns, a cascaded path for the complex, regulated, or tool-heavy ones.

Full replacement is a 12-to-24-month question. And what gates it is not model quality anymore. It is tooling — evaluation, debugging, and compliance infrastructure for a pipeline you cannot currently see inside.

We benchmark both architectures at speko.ai — latency, cost, and reliability by use case. If you are choosing between one model and three, start there.