How We Benchmark TTS: Gate, Profile, Rank
A single naturalness score hides the two failures that matter, so we don't publish one. Our TTS benchmark is three stages — a hard intelligibility gate, a multi-axis acoustic profile, and a nativeness ranking that now agrees with a native speaker at 0.99 Spearman. Here's the whole pipeline, including the stage that used to require a human.
Every TTS leaderboard wants to hand you one number — a MOS score, a naturalness percentage, a star rating. One digit fits on a landing page. It also hides the two failures that actually matter: a voice can sound buttery smooth and say the wrong words, or be perfectly accurate and sound like a 2009 GPS unit. One score can’t tell those apart, which means it can’t tell you which voice to ship.
So we don’t publish one. Our TTS benchmark is a pipeline with three stages — gate, profile, rank — and each one answers a question the previous can’t. The first two we’ve written about before; this post walks the whole thing end to end, because the third stage finally closed the gap that used to need a human in the loop.
Stage 1: the intelligibility gate
Before we measure quality, we measure correctness, because an acoustic score computed on speech that says the wrong thing is garbage with a decimal point.
The gate is a round-trip. We synthesize a fixed prompt set, transcribe the audio back with Whisper-large-v3, compute character error rate against the reference, and check the language Whisper detected. A provider clears the gate only if its CER is low and the audio is actually in the target language.
That second condition catches the trick vendors quietly play in languages they “support”: the model falls back to an English voice reading the foreign text phonetically. Fluent, confident, completely wrong. In our Thai set, the engine with the cleanest signal on the board scored a zero from a native rater — it wasn’t speaking Thai at all, and the gate caught it. Anything that fails here is excluded from every downstream chart. We never score an English-fallback signal against a non-English reference and call the noise a result.
Stage 2: the acoustic profile
Voices that clear the gate get profiled — a multi-axis description of how the audio actually behaves, extracted from the raw signal.
The counter-intuitive one is over-cleanliness. “Robotic” usually means too little variation, not too much: real vocal folds are never perfectly periodic, and strip out the micro-variation and you land in the uncanny valley. We measure jitter, shimmer, and harmonics-to-noise against a natural band recomputed from 50 utterances across 24 human audiobook speakers — a voice that sits below the human band isn’t better, it’s too clean to sound alive. Prosody is the axis listeners feel first: we track the F0 contour and quantify how much the pitch actually moves. For tonal languages we add %V, the vocalic proportion, validated against native ratings at ρ = +0.70 — a rhythm floor, below which the language is being worn as a costume.
One hard-won rule governs this whole stage: these axes profile, they do not rank the top. We checked. The pure signal-quality detectors — codec haze, spectral flatness, clipping — anti-correlate with native quality ratings: the cleanest, most over-regular audio is what a human flags as synthetic. So we ship them as a defect alarm, never a quality score. Where the math can’t crown a winner, we used to say so and hand the verdict to a native listener.
Stage 3: ranking nativeness
That hand-off was the missing piece. Profiling tells you a voice is intelligible, a little too clean, slightly flat — it does not tell you which native-sounding engine is most native. For a long time the only honest answer to that was a human ear, which doesn’t scale to eight providers across eight languages.
Stage 3 is the metric that closed it. Nativeness isn’t a property of a clip on its own — it’s a distance to how a real native says the same sentence. We have every provider synthesize the same FLEURS sentences a native speaker also recorded, embed both in a speech-encoder space, and measure each provider’s distance to the native version of that exact sentence. Match the content, and the residual is accent and tone, not vocabulary.
The whole thing turns on the encoder. We settled on a rank-averaged ensemble of two commercially-licensed backbones:
| Backbone | License | ρ vs native (Thai) |
|---|---|---|
| Whisper-large-v3 | MIT | 0.964 |
| XLS-R-1B | Apache-2.0 | 0.988 |
| Ensemble (rank-average) | MIT + Apache | 0.988 |
The two encoders make different mistakes; averaging their rankings cancels them out, and the pair is fully shippable. Here’s the ensemble’s predicted order next to the native speaker’s scores on Thai:
| Provider | Native score | Model rank |
|---|---|---|
| MiniMax Speech 2.6 HD | 8.5 | 1 |
| Grok TTS | 8.5 | 2 |
| ElevenLabs v3 | 8.0 | 3 |
| GPT-4o Mini TTS | 4.0 | 4 |
| Inworld TTS-2 | 4.0 | 5 |
| Qwen3 TTS Flash | 2.0 | 6 |
| Cartesia Sonic 3.5 | 3.0 | 7 |
| Hume Octave 2 | 1.0 | 8 |
Spearman 0.988, stable to 0.981 under a leave-one-out jackknife — the native tiers reproduced exactly, with one adjacent swap at the bottom between providers scored a 2 and a 3. On Indonesian, a different native rater, the same recipe lands at 0.897. For the first time we have an automated signal that ranks the top of the board, not just eliminates the bottom.
The fairness axis
A benchmark that only measures quality misses how a model behaves when conditions get ugly. We run robustness as its own axis — and it’s where the most surprising result came from: an STT model that returned empty transcripts on every quiet female clip and none of the loud male ones. It looked like bias; it was an input-gain threshold. Same discipline applies to TTS: we test the conditions a landing-page demo never shows.
The honest caveats
Two, and they’re load-bearing. The nativeness ranking is one native rater per language so far, across eight providers — a public ranking deserves two or three raters and an inter-rater agreement number before anyone treats it as final. And the metric ranks accent, not prosody: it can’t yet hear what our rater flagged on the top engine —
“Uniform energy and pace from start to finish, rhythmically predictable — experienced listeners will identify it as AI.”
That distinction is rhythmic, not phonetic, and it’s the next axis (the distribution of phoneme durations) — the thing that separates the eights from the eight-and-a-halves. We won’t pretend the accent ranker is a naturalness ranker.
What this is really about
Gate, then profile, then rank. A single score tells you a voice is “8.4.” This pipeline tells you it’s intelligible, a touch too clean to feel alive, slightly flat in prosody, native in accent but rhythmically a little robotic — and shows you the gate result, the spectrum, and the ranking that prove each claim. The first is a marketing asset. The second is something you can build a product on.
No single number ranks a voice. The work is admitting that, and then measuring everything the easy number skips.