Does the model crack on long audio?
Production buyers feed STT long streams, not 30-second clips. The first question on long audio isn't accuracy — it's whether the provider transcribes the whole clip or silently drops the tail.
Long-form handling
live| Provider | WER | Truncated clips |
|---|---|---|
| ElevenLabs Scribe v2 | 8.6% | 0/7 |
| AssemblyAI Universal-3 Pro | 8.7% | 0/7 |
| Deepgram Nova-3 | 11.4% | 0/7 |
| Cartesia Ink-2 | 14.1% | 0/7 |
| OpenAI GPT-4o Transcribe | 34.9% | ⚠ 3/7 |
trial-long · 7 English clips up to ~49s · "truncated" = output < 70% of reference length · measured 2026-06-01
Four of five providers transcribe long clips cleanly. OpenAI GPT-4o Transcribe silently truncates — it dropped the tail on 3 of 7 clips, which is what drives its 34.9% WER (the others sit at 8–14%). A fast median WER hides this entirely.