Which voice sounds natural — and which lands in the uncanny zone?
Reference-free, waveform-only micro-acoustics — jitter, shimmer, harmonics-to-noise — plus loudness steadiness, measured on each model's own output. No MOS-prediction model, no listening panel; the same Praat / openSMILE spine Speko runs on its text-to-speech benchmark.
xAI Grok produces the cleanest voice — harmonics-to-noise 17.5 dB, the only model above the clean-voice threshold — while the original gpt-realtime is the roughest (highest shimmer, lowest HNR). gpt-realtime-2 lands between them. All three run hotter on jitter than natural human speech.
| Model | Jitter (%) Lower is better natural 0.4–1.0% | Shimmer (%) Lower is better natural 3–6% | Harmonics-to-noise (dB) Higher is better natural > 15 dB clean | Loudness CV Lower is better natural lower = steadier |
|---|---|---|---|---|
| OpenAI gpt-realtime | 1.74 ★ | 8.77 | 10.83 | 0.66 ★ |
| OpenAI gpt-realtime-2 | 2.09 | 6.4 ★ | 13.61 | 0.67 |
| xAI Grok Voice (think-fast) | 1.86 | 6.87 | 17.53 ★ | 0.72 |
★ = best on that metric. Lower jitter/shimmer/loudness-CV is better; higher harmonics-to-noise is better.
Preliminary: 6 varied-prompt clips per model from a single production capture,
no confidence interval yet. The numbers are real and reference-free — they move
to MEASURED once we have ≥3 runs with a CI and per-language splits. Reproduce
from ~/speko-realtime-test/s2s-audio-quality/MULTIMODEL.json.