Does it stay in policy — or make one up?
The single most damaging voice-agent failure is confidently stating a policy the company never gave it. We score whether the model stays grounded — declines, escalates, or admits it doesn't know — instead of fabricating. Graded by an LLM judge against the policy and a pass rubric (an LLM-council is the planned method).
Non-hallucination pass rate
| Model | English | Spanish | Thai | Tagalog | Bahasa Indonesia |
|---|---|---|---|---|---|
| OpenAI gpt-5 | 33% | 33% | 33% | 33% | 33% |
| OpenAI gpt-5-mini | 33% | 33% | 67% | 67% | 67% |
| OpenAI gpt-5-nano | 67% | 67% | 33% | 67% | 33% |
| xAI Grok 4.3 | 67% | 67% | 67% | 67% | 33% |
3 grounding scenarios per language · judged by openai / gpt-5 · 2026-06-10
What the harness caught
gpt-5 fabricated a non-existent “price-match” policy — in BOTH English and Spanish — when the system policy said nothing about price matching. The grounder caught it; gpt-5-nano instead hedged and refused.
Preliminary v0 — 3 scenarios is noise-level; counterintuitive orderings (a bigger model grounding worse) are real behaviors but not yet statistically separable.