STT · Stability

Does the model crack on long audio?

Production buyers feed STT long streams, not 30-second clips. The first question on long audio isn't accuracy — it's whether the provider transcribes the whole clip or silently drops the tail.

Long-form handling

live

Provider	WER	Truncated clips
ElevenLabs Scribe v2	8.6%	0/7
AssemblyAI Universal-3 Pro	8.7%	0/7
Deepgram Nova-3	11.4%	0/7
Cartesia Ink-2	14.1%	0/7
OpenAI GPT-4o Transcribe	34.9%	⚠ 3/7

trial-long · 7 English clips up to ~49s · "truncated" = output < 70% of reference length · measured 2026-06-01

Four of five providers transcribe long clips cleanly. OpenAI GPT-4o Transcribe silently truncates — it dropped the tail on 3 of 7 clips, which is what drives its 34.9% WER (the others sit at 8–14%). A fast median WER hides this entirely.