Your Voice Agent's LLM Speaks Spanish. That Doesn't Mean It Follows the Rules in Spanish.
Tool-calling and grounding are treated as language-agnostic — the model emits the same JSON whether the caller speaks English or Indonesian. The format is language-agnostic. The error rate is not. Published evidence and a preliminary look at where voice-agent LLMs quietly stop following the rules once you leave English.
Ask anyone whether an LLM’s tool-calling is multilingual and you’ll get a shrug: of course it is. The model reads the conversation, decides to call issue_refund, and emits the same structured JSON — {"name":"issue_refund","args":{"order_id":"B5678"}} — whether the customer spoke English, Spanish, or Thai. The function-calling format does not care about language.
That’s true, and it’s the wrong thing to measure. The format is language-agnostic. Whether the model calls the right tool, with the right arguments, at the right moment — and whether it refuses to invent a policy it was never given — is not. Comprehension, argument extraction, and grounding all lean on the model’s competence in that language, and that competence falls off a cliff outside the handful of languages it was heavily trained on. We’ve documented that cliff for speech recognition and speech synthesis — “supported” is a coverage word, not a quality one. The LLM in the middle of the pipeline has the same cliff. It’s just harder to see, because a wrong tool call looks exactly like a right one until it fires.
This is measured, and the gap is large
Start with instruction-following, the floor everything else stands on. BenchMAX evaluated six capabilities across 17 languages and found the floor is much lower than the English number implies: on rule-based instruction-following, Llama-3.1-70B scored 76.2% in English and 11.1% across non-English languages on average; Qwen2.5-72B, 80.8% versus 34.1%. (BenchMAX, arXiv 2502.07346) These are flagship models losing most of their instruction-following the moment you leave English. And crucially, the paper found that “larger models do not consistently have a smaller gap” — scaling up does not reliably close it.
Tool-calling specifically has the same problem. Lost in Execution tested tool calling in Chinese, Hindi, and the low-resource language Igbo and identified “parameter value language mismatch” as a dominant failure mode — the model understands the intent and picks the right tool, then generates the argument values in the user’s language, violating the execution convention the API expects. The paper’s blunt conclusion: inference-time mitigations help, but “none of them can fully recover English-level performance.” (Lost in Execution, arXiv 2601.05366)
And grounding — staying factual, not inventing — varies by language too. A study across 30 languages and six open-source model families found hallucination is real and measurable everywhere, and surfaced two findings that should make a voice-agent buyer pause: smaller models hallucinate more, and models with broader language support display higher hallucination rates. (Hallucination across languages, arXiv 2502.12769) The “we support 100 languages” badge is not free.
What none of these cover is the case that matters for voice agents: the same two failures Retell put real numbers on for English contact-center calls — hallucination and tool-calling errors, where even the best model misses ~1 in 8 of the hardest turns (Retell vCX-Hard) — measured per language, on the kind of input a voice pipeline actually produces. That’s the gap we started filling.
What we measured (and what we’re not claiming)
We give a model a customer-support policy, a toolset, and a customer turn with a known-correct outcome — then render the same scenario across languages and grade two axes: tool-calling deterministically (right tool, right args, right order — no judge needed), and grounding with an LLM judge against the policy. Early, preliminary results, and the failures are real:
- A frontier model invented a policy that didn’t exist — in two languages. Given a policy that said nothing about price matching, GPT-5 confidently described a price-match policy with conditions, in both English and Spanish. The grounder caught it both times. A smaller sibling hedged and refused instead — the safer behavior. “Bigger model” did not mean “more grounded,” which is exactly the non-monotonic behavior the benchmarks above keep finding.
- The best tool-caller still fell off a language cliff. xAI’s Grok 4.3 led our board — 100% tool-calling correctness in English and Spanish — then dropped to 50% overall on Indonesian. Same scenarios, same tools. The capability didn’t change; the reliability did.
- The ranking is language-dependent. A model that grounds well in English is not guaranteed to ground well in Thai, and the order reshuffles by language. One English leaderboard would have hidden all of it.
We are deliberately not turning this into a leaderboard. Six scenarios per language is noise-level; the per-language deltas can flip sign between runs; the non-English scenarios are model-authored translations pending a native pass; grounding is graded by a single judge, not the multi-model council that method deserves. What survives that noise is the shape, and it’s the same shape the published literature reports: the format is uniform, the error rate is not, and the gap is invisible until you measure per language.
The part only we can measure
There’s a second reason this lives at Speko rather than as another text benchmark. A voice-agent LLM does not read clean text — it reads the output of a speech-to-text model, errors and all. And we run the STT board. So we can do the thing a text-only benchmark structurally can’t: feed the LLM the actual ASR hypothesis our gateway produces, in each language, and measure how much grounding and tool-calling degrade when the input is real transcription instead of a clean string. Then the uncomfortable follow-up — does upstream STT unfairness become downstream LLM unfairness? We’ve already found one ASR provider that returned empty transcripts on female-voice clips in one language; if the model downstream is reasoning over a blank, the failure compounds. Nobody is measuring that propagation. We intend to.
What “supported” should mean for the LLM
Same discipline we apply to STT and TTS:
- Grounding and tool-calling measured per language, not as one English aggregate.
- Tool-calling graded against ground truth (expected tool + args) — it’s the one axis that doesn’t need a judge, and “parameter value language mismatch” is exactly what ground-truth grading catches.
- The score reported on real pipeline input (post-STT), not just clean text.
- Fairness propagation tracked — if the transcript is worse for some speakers, is the decision worse too?
- No silent ranking — publish the sample size and the method, and don’t call six scenarios a verdict.
The verdict
“The model supports tool-calling in 50 languages” is a megapixel number — true, and almost meaningless. The function call is language-agnostic; the judgment behind it is not, and the published evidence is consistent and uncomfortable: instruction-following collapses outside English, tool arguments come back in the wrong language, scaling up doesn’t reliably fix it, and broader language support can mean more hallucination, not less. Retell put a real number on the English version of this; the multilingual version, measured on real pipeline output with fairness in view, is open ground. Our early signal says the cliff is there and the leaderboard reshuffles once you cross it. We’ll keep measuring it in the open — and we’ll show our sample size while we do.