TTS · Multilingual

Does it actually speak the language?

Per-language fidelity, gate-verified. We only publish a language once it has a deterministic intelligibility gate that excludes providers whose output isn't actually in-language — so no provider is ranked on a language we haven't confirmed it speaks. Thai (tonal: five-tone shape plus the %V nativeness floor) and Filipino (deterministic checks — intelligibility, pacing, hygiene) are live. Other languages return as their gates land.

Thai is the only tonal language in the set: five lexical tones (mid, low, falling, high, rising) carry word-distinguishing pitch contours. The intelligibility gate excludes any provider that falls back to English instead of speaking Thai.

Nativeness floor — %V

validated · ρ = +0.70 vs native rating (n=11)

Vocalic proportion (voiced duration ÷ speech duration). Syllable-timed Thai sits high; English-timed delivery compresses vowels and drops. A bar below the floor reads as non-native rhythm.

Tone-shape fidelity — per tone

diagnostic · does not rank

Per-tone match between each provider's F0 contour (semitones above its own median) and the canonical Thai five-tone template. Five lexical tones; higher = closer to the canonical pitch trajectory in both shape and register.

Anglicization index

diagnostic · AI · inverts

How often Thai-marked phones (aspirated pʰ tʰ kʰ, palatal ɲ) are detected as their plain English-like equivalents instead. Lower = more Thai-marked phones preserved.

Signal artifacts — Thai

language-independentdiagnostic · not a ranking

The same detector code runs on every language — these measure the engine/codec, not the words. Higher gap floor / gap HF mean the silences aren’t silent (streaming/codec haze that reads as “crackle at word ends”). It flags broken or degraded audio and release-to-release regressions — it does not rank quality (artifact level doesn’t predict the native score).

Provider	Gap floor dBFS ↓=clean	Gap HF dBFS	HNR dB	Flatness
ElevenLabs v3	-69.8	-88.1	16.8	0.010
Cartesia Sonic 3.5	-34.1	-58.0	14.9	0.009
Inworld TTS 2	-172.1	-172.1	15.0	0.136
xAI / Grok TTS	-45.5	-74.8	19.7	0.004
Hume Octave 2	-55.8	-79.6	13.6	0.012
GPT Realtimes2s	-43.4	-72.1	19.7	0.003
Qwen3 TTS Flash	-63.5	-83.0	20.5	0.016
GPT-4o mini TTS	-62.4	-96.9	16.9	0.032
MiniMax Speech 2.6 HD	-50.2	-67.1	19.7	0.014
Deepgram Aura 2	-82.7	-93.8	15.8	0.072
GPT Realtime v2s2s	-48.8	-69.9	18.5	0.009
Cartesia Sonic 3	-53.2	-81.2	13.7	0.007

detectors: gap-haze · gap-HF · HNR · spectral flatness · clipping — identical code across all languages. click-rate omitted (ground-truthed as noise-floor/onset false alarms).