TTS · Audio Quality

Loud, clean, and natural — or too clean?

Three pure-DSP measurements on the same audio the stability probe collected: integrated loudness consistency, vocal micro-imperfections (the uncanny-zone scatter), and basic mastering hygiene. Repetition-loop detection ships once the analyzer threshold is tuned.

LUFS Variance

Integrated loudness (EBU R128) across 50 short utterances per provider. Tells you whether downstream playback needs per-provider normalization.

15.6 dB spread between the loudest and quietest providers — every production app integrating these will need a per-provider gain stage.

Uncanny Zone — Jitter vs Shimmer

Vocal micro-perturbations on voiced segments. The reference box is the 5th–95th percentile of 50 real human audiobook recordings (LibriSpeech dev-clean-2) measured with this identical pipeline — jitter 1.37–3.81%, shimmer 0.71–1.27 dB.

8 of 12 providers sit inside the human reference band (jitter 1.37–3.81% × shimmer 0.71–1.27 dB). The human mean — green diamond at jitter 2.29% × shimmer 0.96 dB — gives the visual anchor for "typical human" on this metric.

Codec Hygiene

Basic mastering health on the long-form drift audio — clipping rate, DC offset, intersample true peak (dBTP), crest factor (peak/RMS in dB).

ProviderClipping %DC offsetTrue peak (dBTP)Crest (dB)
ElevenLabs v30.00980.00077-0.0913.54
Cartesia Sonic0.00000.00005-0.6216.39
MiniMax Speech 2.6 HD0.00000.00057-1.9314.95
Deepgram Aura 20.00000.00000-3.2122.41
Inworld0.00000.00003-3.7317.29
xAI TTS0.00000.00000-3.9118.25
AWS Polly Generative0.00000.00004-5.8917.86
Gradium0.00000.00011-7.2418.47
OpenAI 4o-mini TTS0.00000.00009-10.7719.41

2 of 9 providers ship audio above the −1 dBTP safety margin — those mixes risk clipping on Apple/Spotify normalization.