Does it actually speak the language?
Per-language fidelity, gate-verified. We only publish a language once it has a deterministic intelligibility gate that excludes providers whose output isn't actually in-language — so no provider is ranked on a language we haven't confirmed it speaks. Thai (tonal: five-tone shape plus the %V nativeness floor) and Filipino (deterministic checks — intelligibility, pacing, hygiene) are live. Other languages return as their gates land.
Thai is the only tonal language in the set: five lexical tones (mid, low, falling, high, rising) carry word-distinguishing pitch contours. The intelligibility gate excludes any provider that falls back to English instead of speaking Thai.
Nativeness floor — %V
Vocalic proportion (voiced duration ÷ speech duration). Syllable-timed Thai sits high; English-timed delivery compresses vowels and drops. A bar below the floor reads as non-native rhythm.
Tone-shape fidelity — per tone
Per-tone match between each provider's F0 contour (semitones above its own median) and the canonical Thai five-tone template. Five lexical tones; higher = closer to the canonical pitch trajectory in both shape and register.
Anglicization index
How often Thai-marked phones (aspirated pʰ tʰ kʰ, palatal ɲ) are detected as their plain English-like equivalents instead. Lower = more Thai-marked phones preserved.
Signal artifacts — Thai
language-independentdiagnostic · not a rankingThe same detector code runs on every language — these measure the engine/codec, not the words. Higher gap floor / gap HF mean the silences aren’t silent (streaming/codec haze that reads as “crackle at word ends”). It flags broken or degraded audio and release-to-release regressions — it does not rank quality (artifact level doesn’t predict the native score).
| Provider | Gap floor dBFS ↓=clean | Gap HF dBFS | HNR dB | Flatness | Clip % |
|---|---|---|---|---|---|
| ElevenLabs v3 | -69.8 | -88.1 | 16.8 | 0.010 | 0.00 |
| Cartesia Sonic 3.5 | -34.1 | -58.0 | 14.9 | 0.009 | 0.00 |
| Inworld TTS 2 | -172.1 | -172.1 | 15.0 | 0.136 | 0.00 |
| xAI / Grok TTS | -45.5 | -74.8 | 19.7 | 0.004 | 0.00 |
| Hume Octave 2 | -55.8 | -79.6 | 13.6 | 0.012 | 0.00 |
| GPT Realtimes2s | -43.4 | -72.1 | 19.7 | 0.003 | 0.00 |
| Qwen3 TTS Flash | -63.5 | -83.0 | 20.5 | 0.016 | 0.00 |
| GPT-4o mini TTS | -62.4 | -96.9 | 16.9 | 0.032 | 0.00 |
| MiniMax Speech 2.6 HD | -50.2 | -67.1 | 19.7 | 0.014 | 0.00 |
| Deepgram Aura 2 | -82.7 | -93.8 | 15.8 | 0.072 | 0.00 |
| GPT Realtime v2s2s | -48.8 | -69.9 | 18.5 | 0.009 | 0.00 |
| Cartesia Sonic 3 | -53.2 | -81.2 | 13.7 | 0.007 | 0.00 |
detectors: gap-haze · gap-HF · HNR · spectral flatness · clipping — identical code across all languages. click-rate omitted (ground-truthed as noise-floor/onset false alarms).