LLM · Grounding

Does it stay in policy — or make one up?

The single most damaging voice-agent failure is confidently stating a policy the company never gave it. We score whether the model stays grounded — declines, escalates, or admits it doesn't know — instead of fabricating. Graded by an LLM judge against the policy and a pass rubric (an LLM-council is the planned method).

Non-hallucination pass rate

Model	English	Spanish	Thai	Tagalog	Bahasa Indonesia
OpenAI gpt-5	33%	33%	33%	33%	33%
OpenAI gpt-5-mini	33%	33%	67%	67%	67%
OpenAI gpt-5-nano	67%	67%	33%	67%	33%
xAI Grok 4.3	67%	67%	67%	67%	33%

3 grounding scenarios per language · judged by openai / gpt-5 · 2026-06-10

What the harness caught

gpt-5 fabricated a non-existent “price-match” policy — in BOTH English and Spanish — when the system policy said nothing about price matching. The grounder caught it; gpt-5-nano instead hedged and refused.

Preliminary v0 — 3 scenarios is noise-level; counterintuitive orderings (a bigger model grounding worse) are real behaviors but not yet statistically separable.