Speko Benchmarks · Blog

Blog

Field notes from the bench — what breaks when you measure the whole call, not the clip.

July 8, 2026

gpt-realtime Is the Fastest S2S Model. It Took a Full Conversation to See It.

We retired the old two-turn latency probe and rebuilt the Speech-to-Speech board as a single live ten-turn conversation — real audio, a tool call, a mid-utterance barge-in, and a late memory callback, graded on latency and LLM-judged behavior over n=15 sessions from us-east4. The ranking moved. gpt-realtime now leads on speed at 494ms p50; gpt-realtime-mini matches its quality for a third of the price; and gpt-realtime-2.1-mini — the model this post first crowned — is now mid-pack at ~1011ms. Latency was the old story. Instruction-following is the new one.

Read post → July 7, 2026

Pulse: Fast to the First Word, Fast to the Last One

We ran Smallest AI's Pulse through the Speko harness and wired it into the gateway. It's the rare streaming STT that's fast on both clocks a live agent actually feels — ~64 ms to the first partial, ~180 ms to the final — while holding 5.1% WER on English.

Read post → July 7, 2026

One Attribute From a Stranger: <prosody pitch> Swaps the Speaker on Speechify

Speechify simba-3.2 tops the board on clarity and steadiness. But reach for the one SSML tag everyone assumes is safe — <prosody pitch> — and you are no longer talking to the same person. A +15% nudge halves speaker similarity; ±40% is a different speaker. This isn't drift: the utterance is one second long. rate and volume don't do it. Only pitch does.

Read post → July 5, 2026

LiveKit Inference for Voice Agents: Fast First Token, and the Goodbye Most Models Skip

LiveKit shipped an inference host. We ran it through a voice-agent lens from us-east4 — first-token latency and six dial-discipline behaviors. It's the new speed leader (gemma-4-31b-it at 92ms), and on what matters — tool-calling, not fabricating, scope, honesty, injection resistance — nearly every model is clean. The one shared blind spot is ending the call: most models hang up silently instead of saying goodbye.

Read post → July 5, 2026

The Pause Is the Hard Part: How We Stopped Cutting Callers Off

The single thing that makes a voice agent feel robotic is getting cut off mid-thought. Here is exactly how Speko runs Smart Turn on live calls to decide when a caller is actually done — the threshold, the context-conditioning, the guards — and the measured proof that it works.

Read post → July 5, 2026

Everyone Measures the Clip. Nobody Measures the Call.

WER, MOS and time-to-first-token each score one component doing one thing once. The number that actually decides a voice agent is what one solved call costs — end to end, with the caller talking back.

Read post → June 22, 2026

Your Voice Agent Booked the Wrong Name. WER Said It Was 95% Accurate.

A voice agent only acts on a few words in any sentence — the ones that become tool-call arguments. We measured how often real speech-to-text corrupts them, why Word Error Rate hides it completely, and the verification layer that stops an agent from silently booking the wrong name.

Read post → June 19, 2026

Fast or Natural? Cartesia Sonic-3.5 Refuses to Pick.

In our TTS reliability probe, Cartesia Sonic-3.5 posted the lowest median first-audio of anything we measured — 126ms, network-free — while sitting at the top of the Artificial Analysis naturalness arena, above ElevenLabs v3. The rare part isn't the speed. It's that the speed comes with the quality. Here are the numbers, the caveats, and three clips so you can judge for yourself.

Read post → June 10, 2026

Your Voice Agent's LLM Speaks Spanish. That Doesn't Mean It Follows the Rules in Spanish.

Tool-calling and grounding are treated as language-agnostic — the model emits the same JSON whether the caller speaks English or Indonesian. The format is language-agnostic. The error rate is not. Published evidence and a preliminary look at where voice-agent LLMs quietly stop following the rules once you leave English.

Read post → June 8, 2026

How We Benchmark TTS: Gate, Profile, Rank

A single naturalness score hides the two failures that matter, so we don't publish one. Our TTS benchmark is three stages — a hard intelligibility gate, a multi-axis acoustic profile, and a nativeness ranking that now agrees with a native speaker at 0.99 Spearman. Here's the whole pipeline, including the stage that used to require a human.

Read post → June 8, 2026

Ranking TTS Nativeness with Content-Matched FAD

For years the only honest answer to 'which native-sounding voice is most native' was a human ear. This is the metric that changed that: a content-matched Fréchet Audio Distance over a commercial encoder ensemble that agrees with a native Thai speaker at 0.988 Spearman — the deep dive on Stage 3 of our TTS benchmark.

Read post → June 3, 2026

How Speko Benchmarks TTS: A Gate, Then a Profile

No single number ranks a synthetic voice, so we don't publish one. We run a two-stage pipeline — first prove the speech is intelligible with Whisper-large-v3, then profile how it actually sounds: harmonics, micro-stability, and prosody. Here's the method, in waveforms and spectrograms.

Read post → May 29, 2026

The audio context isn't really 128K tokens

It's about 5 minutes of conversation. Past that, the model accepts your audio and returns nothing. We pushed gpt-realtime-2 through a 60-turn English protocol; it crashed at turn 32, 5 minutes of accumulated assistant audio in. gpt-realtime ran the same prompts comfortably.

Read post → May 29, 2026

Semantic Was Supposed to Be the Smart One. It Lost.

OpenAI's Realtime API gives you two turn-detection modes. server_vad is the energy detector from the 1990s. semantic_vad is the word-aware classifier the docs recommend for natural conversation. We ran 158 trials against gpt-realtime-2 across both modes and seven stimulus types. server_vad with threshold 0.8 absorbs every quiet backchannel we threw at it. semantic_vad never absorbs anything. The recommended mode is worse, at every eagerness setting, in the only acoustic regime where the choice matters.

Read post → May 26, 2026

How Anglicized Is Your TTS? Measuring Phonological Authenticity Across 7 Providers

The anglicization index measures how often a TTS model substitutes a target-language phoneme with its nearest English neighbor. We applied it to 7 providers across 4 Southeast Asian languages and surface the per-language rankings, the largest spread we measured, and what the numbers sound like.

Read post → May 25, 2026

The Cartesia Drift: 10% of Voices Hold the Line for a Minute. The Other 90% Don't.

Cartesia Sonic 3.5 ships 378 English voices. We sampled 50 and ran a long-form drift probe. Only 5 of them stay above the perceptual same-speaker threshold at 60 seconds. The default voice Speko gateway pins is not one of them.

Read post → May 23, 2026

Artificial Analysis Ranks Gemini 3.1 Flash TTS #2. We Asked It for Ten Minutes.

Google's Gemini 3.1 Flash TTS sits at #2 on the Artificial Analysis Speech Arena — Elo 1209, behind only Cartesia Sonic 3.5 (1218) and ahead of ElevenLabs Eleven v3 (1184). The Arena scores blind 30-second clips. We ran a ten-minute take. At length, the model ranked second is the only one of the three that breaks.

Read post → May 21, 2026

Vendors Say They Support 99 Languages. They Don't.

Voice-AI vendors compete on language counts the way phone makers used to compete on megapixels. The published benchmarks, vendor docs, and developer forums tell a quieter story: 'supported' is a marketing word, and the floor it hides is lower than anyone says out loud.

Read post → May 18, 2026

Speech-to-Speech Got Smart. It Still Can't Replace the Cascade.

Speech-to-speech models closed the reasoning gap with text LLMs in 2026. The gap that's left — observability, cost predictability, component swap — is the one that actually decides production architecture.

Read post → April 23, 2026

We Tried to Break Four Voice Agents with a Cough. We Failed.

We ran the same 400 ms cough into four production voice-agent stacks — OpenAI Realtime default and tuned, cascaded Deepgram Nova-3, cascaded ElevenLabs Scribe v2. Four for four absorbed it without yielding. Here are the clips, and the engineering that explains why.

Read post →