S2S · Audio Quality

Which voice sounds natural — and which lands in the uncanny zone?

Reference-free, waveform-only micro-acoustics — jitter, shimmer, harmonics-to-noise — plus loudness steadiness, measured on each model's own output. No MOS-prediction model, no listening panel; the same Praat / openSMILE spine Speko runs on its text-to-speech benchmark.

PRELIMINARY production · 3 realtime models · n=6 clips each · single run

What we found

xAI Grok produces the cleanest voice — harmonics-to-noise 17.5 dB, the only model above the clean-voice threshold — while the original gpt-realtime is the roughest (highest shimmer, lowest HNR). gpt-realtime-2 lands between them. All three run hotter on jitter than natural human speech.

Model	Jitter (%) Lower is better natural 0.4–1.0%	Shimmer (%) Lower is better natural 3–6%	Harmonics-to-noise (dB) Higher is better natural > 15 dB clean	Loudness CV Lower is better natural lower = steadier
OpenAI gpt-realtime	1.74 ★	8.77	10.83	0.66 ★
OpenAI gpt-realtime-2	2.09	6.4 ★	13.61	0.67
xAI Grok Voice (think-fast)	1.86	6.87	17.53 ★	0.72

★ = best on that metric. Lower jitter/shimmer/loudness-CV is better; higher harmonics-to-noise is better.

Preliminary: 6 varied-prompt clips per model from a single production capture, no confidence interval yet. The numbers are real and reference-free — they move to MEASURED once we have ≥3 runs with a CI and per-language splits. Reproduce from ~/speko-realtime-test/s2s-audio-quality/MULTIMODEL.json.