LLM · Tool-calling

Right tool, right args, right moment.

Tool-calling has a ground truth, so we grade it deterministically — no judge: did the model call the expected tool, with the expected arguments, and respect ordering rules (e.g. look up an order before refunding it)? The failure modes mirror Retell's: missing a necessary call, an unnecessary call, or the wrong tool.

Tool-call correctness

Model	English	Spanish	Thai	Tagalog	Bahasa Indonesia
OpenAI gpt-5	67%	67%	67%	67%	67%
OpenAI gpt-5-mini	67%	33%	67%	67%	67%
OpenAI gpt-5-nano	33%	67%	33%	33%	33%
xAI Grok 4.3	100%	100%	67%	67%	67%

3 tool-calling scenarios per language · deterministic grading (tool name + args + ordering) · 2026-06-10

What the harness caught

On the refund scenario, the correct first move is to look the order up before refunding. Smaller models sometimes skip the tool call entirely (it varies by language), which is exactly the “missing necessary call” failure the deterministic grader flags.

Preliminary v0 — 3 scenarios per language. Direction of the English↔Spanish gap varies between runs at this sample size; the per-language split is the point, not the exact numbers.