← all posts

Ranking TTS Nativeness with Content-Matched FAD

For years the only honest answer to 'which native-sounding voice is most native' was a human ear. This is the metric that changed that: a content-matched Fréchet Audio Distance over a commercial encoder ensemble that agrees with a native Thai speaker at 0.988 Spearman — the deep dive on Stage 3 of our TTS benchmark.

Our TTS benchmark runs in three stages — an intelligibility gate, an acoustic profile, and a nativeness ranking. The first two we can explain in signal terms: transcribe it back, measure the spectrum. The third was, for a long time, the stage where the math gave up and we handed the verdict to a native speaker. A profile can tell you a voice is intelligible, a little too clean, slightly flat — it cannot tell you which native-sounding engine is the most native. That requires an ear that grew up in the language.

This post is about the metric that finally automated that ear. It’s a content-matched Fréchet Audio Distance, and on Thai it reproduces a native speaker’s ranking at 0.988 Spearman.

Why nativeness resists the usual metrics

Start with why this is hard. Outside English, the failure that matters isn’t noise or clipping — it’s accent. A voice can be spectrally pristine and still read Thai with an English mouth: wrong tones, wrong vowel lengths, phonemes borrowed from the nearest English neighbor. Every signal-quality number you have will call that audio clean. A native listener calls it foreign in one syllable.

And the learned naturalness scorers don’t save you. They were trained on English read-speech, and on other languages they don’t just miss — they point backwards. The cleanest, most over-regular synthesis is exactly what a human flags as robotic. So the easy metrics are silent on accent, and the convenient ones are wrong about it. That’s why the ranking was human-only.

Fréchet Audio Distance, content-matched

Fréchet Audio Distance is the standard tool for “how close is generated audio to real audio.” The classic version fits a Gaussian to the embeddings of a generic pile of real speech and a Gaussian to your generated speech, and measures the Fréchet distance between them. We tried that first. It detects broken audio and is blind to accent — because a generic native distribution is so broad that a fluent foreign accent sits comfortably inside it.

The fix is content matching, and it’s the whole trick:

  1. Pull a fixed set of native sentences from FLEURS. Have every provider synthesize the exact same sentences a native speaker also recorded.
  2. For each sentence, embed the real native recordings and build a native centroid for that specific sentence.
  3. For each provider, embed its synthesis of that sentence and measure the cosine distance to the native centroid of the same sentence. Average over sentences. Lower is more native.

Matching the content removes what was said as a confound, so the residual distance is how it was said — tone, vowel length, phoneme realization. That single change turned a broken-audio detector into an accent ranker. It’s still FAD in spirit — a Fréchet-family distance to real native speech — but paired sentence-by-sentence instead of distribution-to-distribution.

The backbone is the whole ballgame

A distance is only as good as the space you measure it in. FAD’s embeddings have to be phonetically rich in the target language, or the distance can’t see accent at all. We benchmarked seven speech encoders against a native rater and the spread was enormous — the worst was barely better than chance, the best near-perfect. Two facts decided the shipped design:

  • The obvious choice fails. EnCodec — a neural audio codec, the natural pick if you think of FAD as an audio-quality metric — was the worst backbone for nativeness. Codec embeddings capture signal fidelity and almost no phonetics.
  • The best single model is unshippable. MMS-1B, pretrained on 1,000+ languages, was the most accurate — and CC-BY-NC, research-only.

So we ship a rank-averaged ensemble of two commercially-licensed encoders that, together, match the model we can’t use:

BackboneLicenseρ vs native (Thai)
Whisper-large-v3 encoderMIT0.964
XLS-R-1BApache-2.00.988
Ensemble (rank-average)MIT + Apache0.988

The two encoders make different mistakes. Averaging their rankings — not their raw distances — cancels the errors and is more robust than either alone. (Rank-average beat z-score-average in every run; with eight providers, ranks shrug off the outlier distances that drag a mean around.)

0.988 against a native ear

Here is the ensemble’s predicted order next to the native speaker’s scores on Thai:

ProviderNative scoreModel rank
MiniMax Speech 2.6 HD8.51
Grok TTS8.52
ElevenLabs v38.03
GPT-4o Mini TTS4.04
Inworld TTS-24.05
Qwen3 TTS Flash2.06
Cartesia Sonic 3.53.07
Hume Octave 21.08

Spearman 0.988, holding to 0.981 under a leave-one-out jackknife. The native tiers reproduced exactly — the three native-sounding engines on top, the broken ones at the bottom — with a single adjacent swap between two providers the human scored a 2 and a 3. On Indonesian, with a different native rater, the same recipe lands at 0.897 and is more stable than any single backbone.

One result we didn’t expect: on the real Thai ratings, the commercial XLS-R-1B beat the unshippable MMS, 0.988 to 0.964. The model we couldn’t legally use wasn’t even the most accurate one. The thing we can build and publish was the best in the room.

The discipline note

A 0.99 is exactly the kind of number that should make you suspicious, so here’s the unglamorous part. An early Thai run scored too well — and when we checked, the human-ratings file had been copy-pasted from Indonesian. The ground truth was wrong. We dropped in the real native ratings, re-ran, and the metric scored better against the corrected, harder data. We now version-control the human vectors per language and treat them like code, because a benchmark validated against the wrong answer key isn’t neutral — it’s a confident way to be wrong.

What FAD still can’t hear

The metric ranks accent, and only accent. It cannot hear the thing the native rater flagged about the engine that scored an 8 instead of higher:

“Uniform energy and pace from start to finish, rhythmically predictable — experienced listeners will identify it as AI.”

A competitor scored higher for the opposite reason: “avoids monotony.” That distinction is prosodic, not phonetic — the phonemes are native, the rhythm gives it away. A distance metric living in phoneme-shaped embedding space is structurally blind to it; it will rate a metronome-perfect native accent as native. The tell lives in the distribution of phoneme durations — real speech varies its rhythm, generated speech at a fixed token rate flattens it — and that’s the next axis, the one that separates the eights from the eight-and-a-halves. Content-matched FAD got us an accent ranker we trust. It is not a naturalness ranker, and we won’t ship it as one.

The limits, stated

One native rater per language so far, across eight providers — a public ranking deserves two or three raters and an inter-rater agreement number before it’s final. Validated on Thai and Indonesian; other languages are a hypothesis until a native ear confirms them. And it depends on a clean native FLEURS reference, which varies in quality by language. Within those limits, it does something no automatic metric did before: it ranks the top of the board the way a native speaker does — and it tells you, honestly, the one thing it still can’t hear.