What does the model actually do with text?
Two metrics queued for this territory: voice-clustering across a provider's catalog ("eleven voices, four personalities") and punctuation fidelity (pause length after a comma vs em-dash). Both need probe work that isn't in v1.0 yet.
Both metrics are spec'd and the analyzer code is already on Modal. They unlock once the upstream gaps below are closed.
-
Voice clustering
Catalog every voice each provider offers, render the same 50-prompt set in each, cluster on prosody features. Tells you whether eleven named voices are eleven personalities or four wigs on one model.
Activates with v1.1 multi-voice probe (scrapes /v1/voices)
-
Punctuation fidelity
Same words, three punctuation marks: period vs comma+! vs em-dash. Measure pause length and terminal pitch slope. Reveals which providers ignore commas entirely.
Activates with v1.0.1 once whisperx + pyannote stop fighting