STT · Stability

Does the model crack on long audio?

Production buyers feed STT long streams, not 30-second clips. The first question on long audio isn't accuracy — it's whether the provider transcribes the whole clip or silently drops the tail.

Long-form handling

live
Provider WER Truncated clips
ElevenLabs Scribe v2 8.6% 0/7
AssemblyAI Universal-3 Pro 8.7% 0/7
Deepgram Nova-3 11.4% 0/7
Cartesia Ink-2 14.1% 0/7
OpenAI GPT-4o Transcribe 34.9% ⚠ 3/7

trial-long · 7 English clips up to ~49s · "truncated" = output < 70% of reference length · measured 2026-06-01

Four of five providers transcribe long clips cleanly. OpenAI GPT-4o Transcribe silently truncates — it dropped the tail on 3 of 7 clips, which is what drives its 34.9% WER (the others sit at 8–14%). A fast median WER hides this entirely.