S2S · Multilingual

Does it actually speak the language?

Before any voice-quality score means anything, the model has to actually speak the language. We transcribe each model's spoken reply with Whisper and check it is intelligible and in the target language — three realtime models across five Southeast Asian languages. Pick a tab for that language's gate; Thai also carries a tone-shape probe, and cross-lingual voice-hold is pending an S2S run. n=1 per cell — directional.

Thai is the only tonal language in this set — five lexical tones (mid, low, falling, high, rising) carry word-distinguishing pitch contours, and it is the hardest language here.

Nativeness floor — %V

validated · ρ = +0.70 vs native rating (n=11)

Vocalic proportion (voiced duration ÷ speech duration). Syllable-timed Thai sits high; English-timed delivery compresses vowels and drops. A bar below the floor reads as non-native rhythm.

Tone-shape fidelity — per tone

diagnostic · does not rank

Per-tone match between each provider's F0 contour (semitones above its own median) and the canonical Thai five-tone template. Five lexical tones; higher = closer to the canonical pitch trajectory in both shape and register.

Anglicization index

diagnostic · AI · inverts

How often Thai-marked phones (aspirated pʰ tʰ kʰ, palatal ɲ) are detected as their plain English-like equivalents instead. Lower = more Thai-marked phones preserved.

The tone tab's per-tone radar — the model's pitch contour scored against the canonical Thai five-tone template, per syllable — ports from the TTS tone_shape analyzer and is pending an S2S run. Until then the tab shows a "coming soon" state; we do not synthesize per-tone numbers here.