TTS · Behavioral

What does the model actually do with text?

Two metrics queued for this territory: voice-clustering across a provider's catalog ("eleven voices, four personalities") and punctuation fidelity (pause length after a comma vs em-dash). Both need probe work that isn't in v1.0 yet.

coming soon

Both metrics are spec'd and the analyzer code is already on Modal. They unlock once the upstream gaps below are closed.

Voice clustering

Catalog every voice each provider offers, render the same 50-prompt set in each, cluster on prosody features. Tells you whether eleven named voices are eleven personalities or four wigs on one model.

Activates with v1.1 multi-voice probe (scrapes /v1/voices)
Punctuation fidelity

Same words, three punctuation marks: period vs comma+! vs em-dash. Measure pause length and terminal pitch slope. Reveals which providers ignore commas entirely.

Activates with v1.0.1 once whisperx + pyannote stop fighting

What does the model actually do with text?

Voice clustering

Punctuation fidelity