TTS · Multilingual

Does it actually speak the language?

Per-language fidelity, gate-verified. We only publish a language once it has a deterministic intelligibility gate that excludes providers whose output isn't actually in-language — so no provider is ranked on a language we haven't confirmed it speaks. Thai (tonal: five-tone shape plus the %V nativeness floor) and Filipino (deterministic checks — intelligibility, pacing, hygiene) are live. Other languages return as their gates land.

Thai is the only tonal language in the set: five lexical tones (mid, low, falling, high, rising) carry word-distinguishing pitch contours. The intelligibility gate excludes any provider that falls back to English instead of speaking Thai.

Nativeness floor — %V

validated · ρ = +0.70 vs native rating (n=11)

Vocalic proportion (voiced duration ÷ speech duration). Syllable-timed Thai sits high; English-timed delivery compresses vowels and drops. A bar below the floor reads as non-native rhythm.

Tone-shape fidelity — per tone

diagnostic · does not rank

Per-tone match between each provider's F0 contour (semitones above its own median) and the canonical Thai five-tone template. Five lexical tones; higher = closer to the canonical pitch trajectory in both shape and register.

Anglicization index

diagnostic · AI · inverts

How often Thai-marked phones (aspirated pʰ tʰ kʰ, palatal ɲ) are detected as their plain English-like equivalents instead. Lower = more Thai-marked phones preserved.

Signal artifacts — Thai

language-independentdiagnostic · not a ranking

The same detector code runs on every language — these measure the engine/codec, not the words. Higher gap floor / gap HF mean the silences aren’t silent (streaming/codec haze that reads as “crackle at word ends”). It flags broken or degraded audio and release-to-release regressions — it does not rank quality (artifact level doesn’t predict the native score).

ProviderGap floor
dBFS ↓=clean
Gap HF
dBFS
HNR
dB
FlatnessClip %
ElevenLabs v3-69.8-88.116.80.0100.00
Cartesia Sonic 3.5-34.1-58.014.90.0090.00
Inworld TTS 2-172.1-172.115.00.1360.00
xAI / Grok TTS-45.5-74.819.70.0040.00
Hume Octave 2-55.8-79.613.60.0120.00
GPT Realtimes2s-43.4-72.119.70.0030.00
Qwen3 TTS Flash-63.5-83.020.50.0160.00
GPT-4o mini TTS-62.4-96.916.90.0320.00
MiniMax Speech 2.6 HD-50.2-67.119.70.0140.00
Deepgram Aura 2-82.7-93.815.80.0720.00
GPT Realtime v2s2s-48.8-69.918.50.0090.00
Cartesia Sonic 3-53.2-81.213.70.0070.00

detectors: gap-haze · gap-HF · HNR · spectral flatness · clipping — identical code across all languages. click-rate omitted (ground-truthed as noise-floor/onset false alarms).