Does the voice stay the same?
Identity drift over a sustained single take (~2 minutes), run-to-run reproducibility on a fixed prompt, and the prosodic vocabulary the model actually uses across a real corpus.
v1 note: each provider is measured on one default voice. Numbers are per-voice, not whole-catalog — multi-voice sampling lands in v1.1.
Identity Drift Slope
Cosine similarity of speaker embedding vs the start of a long take.
Cartesia Sonic drifts hardest in English: β = -0.303 ± 0.043/min over 5 takes (mean R² 0.81).
Run-to-Run Variance
Standard deviation of speaker embedding across 20 runs of the same prompt.
1 of 12 show σ < 0.001 — caching or deterministic decoding, not measured variability (AWS Polly Generative). AWS Polly Generative most consistent at σ=0.000; OpenAI 4o-mini TTS most variable at σ=0.073. σ is taken over C(20,2) dependent pairs — read it as a relative ordering, not an absolute per-take spread.
DS-WED (Prosody Diversity)
Pairwise weighted edit distance between the 50 prosodic-token sequences per voice (0..1, higher = more variety).
ElevenLabs v3 reshapes prosody most (score 0.75); Gradium is the most templated (0.65) — a 15% gap on the same 50 sentences.