STT Benchmarkpublished
Speech-to-Text
Accuracy, speed, and cost on clean English audio.
| Provider | |||
|---|---|---|---|
ElevenLabs Scribe v2 Realtime | 3.4% | 1.0 s | $0.0067/min |
Alibaba qwen3-asr-flash | 3.5% | 0.6 s | $0.0021/min |
AssemblyAI Universal-3 Pro | 5.1% | 4.2 s | $0.0067/min |
Google Cloud Chirp 2 | 5.4% | 4.5 s | $0.024/min |
ElevenLabs Scribe v1 | 5.4% | 1.1 s | $0.0067/min |
Google Gemini 2.5 Flash (STT) | 6.0% | 2.5 s | $0.0002/min |
AssemblyAI Universal-2 | 6.2% | 3.7 s | $0.0062/min |
Deepgram Nova-3 | 7.8% | 2.4 s | $0.0043/min |
OpenAI gpt-4o-mini-transcribe | 8.2% | 2.0 s | $0.0030/min |
OpenAI whisper-1 | 11.6% | 2.6 s | $0.0060/min |
xAI Grok STT | 16.8% | 0.9 s | $0.0017/min |
OpenAI gpt-4o-transcribe | 19.4% | 1.9 s | $0.0060/min |
Clean English audio · word error rate, p50 latency, cost per minute · lower is better