← all posts

Vendors Say They Support 99 Languages. They Don't.

Voice-AI vendors compete on language counts the way phone makers used to compete on megapixels. The published benchmarks, vendor docs, and developer forums tell a quieter story: 'supported' is a marketing word, and the floor it hides is lower than anyone says out loud.

Every voice-AI landing page leads with a number. 99 languages. 100+. 147. 70+. Pick a modality — STT, TTS, S2S — and the headline is the same. So we went looking for what that number actually means in production. The answer, drawn from vendor docs, public benchmarks, and a few hundred developer threads, is that “supported” is not a quality claim. It is a coverage claim — and the floor is lower than anyone wants to say out loud.

The most candid single sentence in this whole industry sits inside a Microsoft Azure documentation page describing OpenAI’s gpt-realtime:

“While the underlying model was trained on 98 languages, OpenAI only lists the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model returns results for languages not listed but the quality will be low.”

Azure Voice Live language support

Read that twice. The bar for “supported” is below 50% word error rate. One word wrong in every two is, by the vendor’s own admission, the threshold for inclusion on the list. Everything else is “the model returns results but the quality will be low” — also still callable, still billed, still demoable.

That sentence is the entire article. The rest is evidence.

The asterisk on the headline number

Here is the gap between marketing copy and documentation, vendor by vendor, all pulled directly from current pages.

VendorHeadline claimWhat the docs actually say
AssemblyAI”99 languages” (blog)Universal-3 Pro supports 6 languages. The other 93 silently fall back to Universal-2, the older model. (docs)
Deepgram Nova-3”Multilingual”10 languages in the multilingual model. 40+ more sit in monolingual mode. (docs)
ElevenLabs Scribe v2 Realtime”90+ languages”6 at the <150ms latency tier; 30 benchmarked at 93.5% aggregate accuracy; everything else is “supported.” (blog)
OpenAI Whisper”Multilingual”Trained on 98 languages. Model card says “strong ASR results in ~10 languages.” (model card)
OpenAI gpt-realtime”Multilingual”57 listed (the ones above the <50% WER bar), from 98 trained.
Azure Voice Live”100+ locales”The multilingual model serves 15 locales. Belgian Dutch, Greek, Tanzanian Swahili, Cantonese Simplified explicitly lack fast transcription.
AWS Transcribe”Over 100 languages” (announcement)Streaming disabled in 5 regions for many languages; custom language models work in roughly 5; Call Analytics is English-only. (feature matrix)
Grok Voice Agent”100+ languages” (news post)The developer docs say “20+ languages with native-quality accents” with the rest having “varying degrees of accuracy.” (docs)
Sesame CSM”Conversational voice”English only. The README states: “some multilingual ability emerges due to dataset contamination, but it does not perform well yet.” (repo)
Resemble AI”142+ languages” (platform)Same vendor cites 23 (Chatterbox open source), 100 (Localize), and 142+ (platform) on three different pages. (Localize announcement)
ElevenLabs Eleven v3”70+ languages”No per-language quality numbers published. v3 is alpha.
Google Gemini Live”97 languages” (docs)Honest count, no per-language quality. Native-audio output models accept no explicit language pin — the model picks.

Of the 24 vendors we catalogued across STT, TTS, and S2S, three publish per-language WER: OpenAI (Whisper model card), Deepgram (the 10-language Nova-3 multilingual blog), and Speechmatics (partial). The other 21 publish a single aggregate number or nothing. The structural reason is obvious — publishing the Spearman correlation between training hours and WER would make “we support 99 languages” indefensible.

So we went looking for that data anyway. It exists, just not on the vendors’ sites.

The cliff, in numbers

Public academic benchmarks have stopped pretending. They do not report a single multilingual score; they report the worst-language CER, because that is the number that determines whether a stack is shippable in a given market.

Whisper large-v3 on FLEURS drops below 10% WER on roughly 30 languages and then falls off a cliff. By the time you reach Khmer, Lao, Burmese, Sinhala, and Amharic, you are looking at 40–55% WER. (source) The Whisper paper itself states it plainly: “the amount of pre-training speech recognition data for a given language is very predictive of zero-shot performance on that language in Fleurs.” (ar5iv Whisper paper)

Pashto is the smoking gun. Whisper’s zero-shot WER on Common Voice ~24 Pashto runs from 90% to 297%. The medium model collapses to 461% — insertions exceed the reference length, which is a polite way of saying the model is hallucinating more than it is transcribing. The strongest publicly reported number, SeamlessM4T-Large at 39.7%, is still a rough draft at best. (ACL Anthology)

Urdu swings 2.6× across datasets — CER 21.8% on Common Voice, 56.9% on FLEURS, same model, same language, different recordings. (ML-SUPERB 2.0) The dataset is doing more work than the model.

ML-SUPERB 2.0 sweeps 142 languages across 15 datasets and finds the standard deviation of CER across languages is 10% to 22%. The worst-language CER reaches 337.6%. The 2025 Interspeech challenge replaced “average CER” with “average CER across the worst-performing 15 languages” — the field has formally admitted the long tail is the benchmark. (Interspeech 2025 challenge)

Meta’s MMS paper quantified what nobody else would. Scaling from 61 to 1,107 languages on the same dense model adds +5.1 CER on the languages the model already knew. The breadth tax is real; the only mitigation is per-language adapters — i.e. per-language compute. (JMLR MMS)

And finally — the benchmarks vendors do point at are curated. The Hugging Face Open ASR Leaderboard’s multilingual track is five European languages: German, French, Italian, Spanish, Portuguese. (leaderboard paper) The languages most likely to fail are absent by design. ElevenLabs Scribe v2 averages 2.67% WER on that slice. Outside that slice, we have no idea.

The honest reading: there are roughly 20–30 languages where modern ASR is production-ready, another ~30 where it is workable with cleanup, and a long tail where “supported” means a model will return characters when you send it audio.

What it actually looks like in production

The public benchmarks tell you the shape of the cliff. To see what the cliff feels like at the bottom, you have to read GitHub issues.

Script collapse. Whisper outputs Hindi audio in Urdu, Persian, or Arabic script, with the language ID reported correctly and the text rendered wrong. Discussion #1662 has been open since 2023:

“I call model.transcribe(‘filename’, language=‘hi’) the returned text is in urdu/persian.”

Dialect collapse. Whisper’s Arabic is MSA-only. A native speaker on Discussion #466 put it directly: “The AI seems to have no understanding of Arabic dialects whatsoever — no one actually speaks traditional Arabic. Every Arabic country has its own spin on the language.” Deepgram, separately, said Arabic “isn’t currently on our roadmap. Right-to-left languages raise some unique challenges.” (Deepgram discussion #787)

Hallucination as ad copy. Whisper trained on YouTube subtitles, and the subtitles in some languages were not subtitles. Discussion #961 — Vietnamese:

“A large part of the resulting subtitle file returned to me is not subtitle lines, but a call for viewers to subscribe to the youtube channel.”

The same thread reports the same failure mode in Indonesian. Whisper large-v3 also generates subscribe-spam in silent regions in Mandarin and English. (Discussion #1783)

Code-switching collapses to translation. Whisper does not handle bilingual audio. Discussion #2009:

“I have audio with a single speaker switching between native (Marathi) language and English. Whisper often just translates these English words into Marathi instead of transcribing them.”

Deepgram has its own version of the same problem — Nova-2 swaps the English/Spanish speaker label at code-switch boundaries. The issue (Deepgram #874) was acknowledged by Deepgram engineering eight months before another developer commented it was still broken.

Silent production regressions. Deepgram Nova-2 Chinese broke overnight in March 2025 — “it worked previously and suddenly stopped working.” (Discussion #1137) Whisper API silently broke Malayalam, Nepali, and Telugu transcription in October 2024. (OpenAI community thread) Both teams found out from their own users.

Phonetic-based language ID. A native Finnish speaker reported on Discussion #475 that Whisper labels every language he speaks as Finnish — the model picks up his accent, not the words. A Whisper maintainer confirmed: “the language detection might be focusing too much on the phonetics and less on the linguistic content.”

These aren’t edge cases. They are the default behavior for anyone shipping outside English.

TTS: where “supported” means English with a costume

The TTS failure mode is different. The model speaks the right language, but it speaks it like an American.

The single most canonical bug: in Spanish, the number 11 is read as “eleven” instead of “once” — because numbers route through English text normalization before the voice ever sees them. (Deepgram analysis) The exact issue is filed against the ElevenLabs Python client as #253: “elevenlabs translates numbers and dates to audio in English, regardless of the target language.”

Same vendor, Issue #149: “when I use other languages, it reads the sentence with English pronunciation if the sentence is long.” It’s not specific to ElevenLabs. OpenAI’s TTS gets the same complaint, across Dutch, Italian, Hebrew, Armenian, French, Spanish, Portuguese, Russian, German, Chinese, Esperanto, Belarusian. One quote, German: “In German the output sounds like an American that does speak German really well.”

The HN thread for the ElevenLabs v3 launch is the unvarnished version. Native speakers, listening to the launch demos in real time:

  • Russian: “absolute murder of the Russian language”
  • French: “the French one sounded like an Alabaman who took a semester of college French”
  • French again: “one of them starts reading French like a native English speaker, then mid-sentence switches to a proper accent”
  • Norwegian: “literally just Danish, it’s incredibly bad”
  • Romanian: “sounds awful too, like the TTSes from 15 years ago”

(HN thread)

The listening-test literature confirms what HN already hears. A 2024 ICLR Tiny Paper measured naturalness MOS across six languages with 35 listeners (arXiv 2406.00022):

LanguageNaive multilingual MOSBest transfer-learning MOS
German3.014.35
French2.854.26
Spanish2.734.11
Dutch2.444.15
Hindi2.324.01
Tamil2.123.85

Tamil — a language vendors routinely list under “supported” — sits at MOS 2.12 in default multilingual mode. That is below “fair.” The best published cross-lingual voice-conversion result reaches 3.28 for naturalness and 2.77 for speaker similarity. (arXiv 2012.14039) When you clone a voice across languages, you lose both naturalness and identity.

Then there are the language-specific failures the listening tests cannot capture. Mandarin tone sandhi — the rule that turns sequential Tone-3 syllables into Tone-2 — is not consistently applied even by BERT-augmented Tacotron 2. (arXiv 1912.10915) Get it wrong and you change the word, not the prosody. Turkish vowel harmony, Hindi schwa deletion, French liaison, Portuguese European versus Brazilian variants — each “needs explicit handling” per Speechmatics, and the vendor’s headline language count does not tell you which ones get it.

A bilingual Spanish-English listener study at −3 dB SNR found that “listeners performed worse in both the Spanish target condition… they also performed worse in the code-switched contexts overall.” (Frontiers 2025) Code-switching costs intelligibility independent of listener proficiency.

S2S: where the failures get weird

Speech-to-speech models do not just degrade in non-English. They behave in ways the cascaded stack never would.

Name-triggered code-switching. OpenAI Realtime infers a language from English-spelled proper nouns mid-utterance. From community.openai.com/t/realtime-api-language-switching:

“Hello Can I Speak to Alina? GPT Realtime answers me in Italian. I also notice it changes depending on the name. If I call Anastasia, it changed to Russian. Amir, arabic… Can I send a voicemail message to Amir? The model unexpectedly switches to Hebrew.”

Explicit “ignore false signals from names” system prompts were bypassed.

Mid-call drift. Same model, in French: “after just three messages, it gets completely lost, starts speaking in other languages, or responds to old questions.” (thread) Same model again, in German, from the realtime-api-beta repo: user asks “Was ist Deine Lieblingsfußballmanschaft?” — the agent answers “Guten Appetit. Was gibt es denn?” It’s the right language. It’s an unrelated reply.

Content-filter false positives on non-English only. Multiple developers report {"type":"incomplete","reason":"content_filter"} triggering specifically when the input audio is not English. (thread) One developer reports merely mentioning the input language triggers the filter.

Production regression on snapshot rollover. A clinic running gpt-realtime-mini in Romanian reported the October 2025 snapshot started “hallucinating non-existing departments, services, and operational details that were not present in the database/context” — inventing an entire neurology department. The same operator: “quality has been quite poor for Portuguese voice agents. We are seeing issues with context loss, weaker understanding.” (thread)

Gemini Live’s hard-pin gap. Google publishes a 97-language list — among the most honest in the industry — but the docs also note that native-audio output models “can switch between languages naturally during conversation” and accept no explicit BCP-47 language pin. A GitHub Cookbook issue reports the API switching to a German voice on English input. A LiveKit Agents issue reports transcription randomly outputting non-English when the speaker is clearly speaking English. Both closed without a fix.

Sesame’s honest README. “The model has some capacity for non-English languages due to data contamination in the training data, but it likely won’t do well.” (CSM repo) Future tense — “we intend to expand language support to over 20 languages.” This is the only vendor we found whose docs match their behavior.

There is no public production-grade S2S benchmark with per-language breakdowns. Big Bench Audio is generated from English voices. AudioBench is speech-language understanding, not dialog quality. WildSpeech-Bench is English-heavy by construction. The training data does not exist at scale either — the ACL 2023 corpus survey notes “existing speech translation parallel corpora being orders of magnitude smaller than those available for automatic speech recognition and machine translation.” (paper) S2S models need aligned speech-to-speech pairs. Outside English-to-English and English-to-X translation, those pairs barely exist.

What “supported” should mean

If we were rebuilding the vendor language-support table from scratch, “supported” would mean five things, none of which the headline number captures:

  1. Per-language WER published, on a non-curated test set (Common Voice 17 or FLEURS, not the vendor’s own audio).
  2. Per-language MOS published, with a documented listener panel.
  3. Behavior under code-switching documented — does the model transcribe, translate, or collapse?
  4. Script and dialect coverage explicit — Arabic is not a language, it is a family. Mandarin is not a language, it is a family.
  5. No silent English fallback in numbers, dates, proper nouns, or out-of-distribution audio.

Until vendors publish these — three of twenty-four do, and only partially — the supported-language count tells you what the model has seen in training, not what it will do for your users.

If you are shipping voice today, the actual workflow:

  • Pick the 5–10 languages you actually need.
  • Benchmark them yourself on your own audio.
  • Budget for per-language text normalization (numbers, dates, currency, addresses, abbreviations) before the TTS sees the string.
  • Plan for the silent regression. Vendors break languages without telling you. Run a regression suite on a fixed audio bank weekly.
  • For S2S in any non-English language, pin language hard in the prompt, watch for mid-call drift, and verify the content filter does not fire false positives.

The verdict

The voice-AI industry is in the megapixel phase. Vendors compete on a single big number because there is no shared benchmark forcing per-language disclosure, and no per-language disclosure means there is no penalty for the gap between marketing and reality. The Azure documentation accident — “OpenAI only lists the languages that exceeded <50% word error rate” — is the entire industry in one sentence. It is just usually buried better.

Headline language counts are noise. The signal is per-language WER for STT, per-language MOS for TTS, and per-language behavior documentation for S2S. Almost nobody publishes any of it. Until they do, we’ll keep running our own benchmarks — and so should you.