TTS · Stability

Does the voice stay the same?

Identity drift over a sustained single take (~2 minutes), run-to-run reproducibility on a fixed prompt, and the prosodic vocabulary the model actually uses across a real corpus.

v1 note: each provider is measured on one default voice. Numbers are per-voice, not whole-catalog — multi-voice sampling lands in v1.1.

Identity Drift Slope

language

Cosine similarity of speaker embedding vs the start of a long take.

Cartesia Sonic drifts hardest in English: β = -0.303 ± 0.043/min over 5 takes (mean R² 0.81).

Run-to-Run Variance

Standard deviation of speaker embedding across 20 runs of the same prompt.

1 of 12 show σ < 0.001 — caching or deterministic decoding, not measured variability (AWS Polly Generative). AWS Polly Generative most consistent at σ=0.000; OpenAI 4o-mini TTS most variable at σ=0.073. σ is taken over C(20,2) dependent pairs — read it as a relative ordering, not an absolute per-take spread.

DS-WED (Prosody Diversity)

Pairwise weighted edit distance between the 50 prosodic-token sequences per voice (0..1, higher = more variety).

ElevenLabs v3 reshapes prosody most (score 0.75); Gradium is the most templated (0.65) — a 15% gap on the same 50 sentences.