LLM · Grounding

Does it stay in policy — or make one up?

The single most damaging voice-agent failure is confidently stating a policy the company never gave it. We score whether the model stays grounded — declines, escalates, or admits it doesn't know — instead of fabricating. Graded by an LLM judge against the policy and a pass rubric (an LLM-council is the planned method).

Non-hallucination pass rate

Model EnglishSpanishThaiTagalogBahasa Indonesia
OpenAI gpt-5 33%33%33%33%33%
OpenAI gpt-5-mini 33%33%67%67%67%
OpenAI gpt-5-nano 67%67%33%67%33%
xAI Grok 4.3 67%67%67%67%33%

3 grounding scenarios per language · judged by openai / gpt-5 · 2026-06-10

What the harness caught

gpt-5 fabricated a non-existent “price-match” policy — in BOTH English and Spanish — when the system policy said nothing about price matching. The grounder caught it; gpt-5-nano instead hedged and refused.

Preliminary v0 — 3 scenarios is noise-level; counterintuitive orderings (a bigger model grounding worse) are real behaviors but not yet statistically separable.