Does it actually speak the language?
Before any voice-quality score means anything, the model has to actually speak the language. We transcribe each model's spoken reply with Whisper and check it is intelligible and in the target language — three realtime models across five Southeast Asian languages. Pick a tab for that language's gate; Thai also carries a tone-shape probe, and cross-lingual voice-hold is pending an S2S run. n=1 per cell — directional.
Thai is the only tonal language in this set — five lexical tones (mid, low, falling, high, rising) carry word-distinguishing pitch contours, and it is the hardest language here.
Nativeness floor — %V
Vocalic proportion (voiced duration ÷ speech duration). Syllable-timed Thai sits high; English-timed delivery compresses vowels and drops. A bar below the floor reads as non-native rhythm.
Tone-shape fidelity — per tone
Per-tone match between each provider's F0 contour (semitones above its own median) and the canonical Thai five-tone template. Five lexical tones; higher = closer to the canonical pitch trajectory in both shape and register.
Anglicization index
How often Thai-marked phones (aspirated pʰ tʰ kʰ, palatal ɲ) are detected as their plain English-like equivalents instead. Lower = more Thai-marked phones preserved.
The tone tab's per-tone radar — the model's pitch contour scored against
the canonical Thai five-tone template, per syllable — ports from the TTS
tone_shape analyzer and is
pending an S2S run. Until then the tab shows a "coming soon"
state; we do not synthesize per-tone numbers here.