LLM · Multilingual

The capability is language-agnostic. The error rate isn't.

A model emits the same structured tool call whether the caller speaks English or Spanish — the format is language-agnostic. But comprehension, argument extraction, and staying grounded all degrade outside English, so the model's error rate shifts by language. We run the same scenarios in each language and report the delta. This is the axis Retell's English-only benchmark does not cover — and it's measured on the output of Speko's own STT pipeline.

Reliability delta vs English

Pick a language to compare against the English baseline. 4 languages so far — the dropdown (not more columns) is how this scales as we add more.

Compare English vs

Model	Spanish tool	Spanish ground	Spanish overall	English overall	Δ (sel − en)
OpenAI gpt-5	67%	33%	50%	50%	+0pt
OpenAI gpt-5-mini	33%	33%	33%	50%	-17pt
OpenAI gpt-5-nano	67%	67%	67%	50%	+17pt
xAI Grok 4.3	100%	67%	83%	83%	+0pt

overall pass rate, both axes · 6 scenarios per language · 2026-06-10 · non-English scenarios are model-authored translations (v0)

Preliminary v0: at 6 scenarios the delta is within noise and can flip sign between runs. The structure — same scenario across languages, one delta — is what scales; the numbers firm up as the case set grows and a native-speaker translation pass lands.

English contact-center axis — cited prior art

Retell AI — vCX-Hard Benchmark — Real-production contact-center calls (English), scored by an LLM council on non-hallucination and tool-call correctness. Speko cites this for the English production axis rather than cloning it; Speko’s differentiated axis is multilingual reliability measured on the output of its own STT pipeline.