S2S · Audio Quality

Which voice sounds natural — and which lands in the uncanny zone?

Reference-free, waveform-only micro-acoustics — jitter, shimmer, harmonics-to-noise — plus loudness steadiness, measured on each model's own output. No MOS-prediction model, no listening panel; the same Praat / openSMILE spine Speko runs on its text-to-speech benchmark.

PRELIMINARY production · 3 realtime models · n=6 clips each · single run
What we found

xAI Grok produces the cleanest voice — harmonics-to-noise 17.5 dB, the only model above the clean-voice threshold — while the original gpt-realtime is the roughest (highest shimmer, lowest HNR). gpt-realtime-2 lands between them. All three run hotter on jitter than natural human speech.

Model
Jitter (%) Lower is better natural 0.4–1.0%
Shimmer (%) Lower is better natural 3–6%
Harmonics-to-noise (dB) Higher is better natural > 15 dB clean
Loudness CV Lower is better natural lower = steadier
OpenAI gpt-realtime 1.74 ★ 8.77 10.83 0.66 ★
OpenAI gpt-realtime-2 2.09 6.4 ★ 13.61 0.67
xAI Grok Voice (think-fast) 1.86 6.87 17.53 ★ 0.72

★ = best on that metric. Lower jitter/shimmer/loudness-CV is better; higher harmonics-to-noise is better.

Preliminary: 6 varied-prompt clips per model from a single production capture, no confidence interval yet. The numbers are real and reference-free — they move to MEASURED once we have ≥3 runs with a CI and per-language splits. Reproduce from ~/speko-realtime-test/s2s-audio-quality/MULTIMODEL.json.