The capability is language-agnostic. The error rate isn't.
A model emits the same structured tool call whether the caller speaks English or Spanish — the format is language-agnostic. But comprehension, argument extraction, and staying grounded all degrade outside English, so the model's error rate shifts by language. We run the same scenarios in each language and report the delta. This is the axis Retell's English-only benchmark does not cover — and it's measured on the output of Speko's own STT pipeline.
Reliability delta vs English
Pick a language to compare against the English baseline. 4 languages so far — the dropdown (not more columns) is how this scales as we add more.
| Model | Spanish tool | Spanish ground | Spanish overall | English overall | Δ (sel − en) |
|---|---|---|---|---|---|
| OpenAI gpt-5 | 67% | 33% | 50% | 50% | +0pt |
| OpenAI gpt-5-mini | 33% | 33% | 33% | 50% | -17pt |
| OpenAI gpt-5-nano | 67% | 67% | 67% | 50% | +17pt |
| xAI Grok 4.3 | 100% | 67% | 83% | 83% | +0pt |
overall pass rate, both axes · 6 scenarios per language · 2026-06-10 · non-English scenarios are model-authored translations (v0)
Preliminary v0: at 6 scenarios the delta is within noise and can flip sign between runs. The structure — same scenario across languages, one delta — is what scales; the numbers firm up as the case set grows and a native-speaker translation pass lands.
English contact-center axis — cited prior art
Retell AI — vCX-Hard Benchmark — Real-production contact-center calls (English), scored by an LLM council on non-hallucination and tool-call correctness. Speko cites this for the English production axis rather than cloning it; Speko’s differentiated axis is multilingual reliability measured on the output of its own STT pipeline.