How this benchmark is measured — and who measures it.
Every number on the /stt pages comes from one open harness, run against each provider's native API on recognized public audio, scored by word error rate. This page documents that harness in full, states plainly that Speko has a commercial interest, and lists exactly what v1 does and does not claim. If a metric is not yet measured, we say so and suppress it — we do not fill the gap with a guess.
The harness
The probe loads a fixed manifest of audio clips with human reference transcripts, sends each clip to every provider, and scores the returned text against the reference. Each provider is reached through its own native API with our own key — not through the Speko gateway — so Speko is not in the measurement path.
- Driver
-
speko-bench-cli(TypeScript / Bun). Same clips, same parameters, every provider — re-runnable on each new model release. - Endpoints
- Native provider APIs, called directly (transcription endpoints), temperature 0, no prompt biasing or domain hints — a clean baseline measured identically for everyone.
- Scoring
- Word Error Rate (word-level Levenshtein) with a Unicode-safe normalizer: NFC + lowercase, hyphens split, spelled-out numbers folded to digits (so "twentieth" matches "20th"), punctuation stripped. Bootstrap 95% CIs.
- Artifact
- Every run saves the per-clip reference and hypothesis, so any number can be re-scored offline with no new API calls — a fixed scorer change re-prices history for free.
Corpora
The headline WER is a balanced mean across three clean-English corpora, 50 clips each. They are chosen for diversity of speaking style, so a single easy domain can't flatter a provider.
| Corpus | Domain | Clips | Note |
|---|---|---|---|
| VoxPopuli | European-Parliament speech · 38 speakers | 50 clean English | Formal, real (not read-aloud) speech. |
| FLEURS | Read Wikipedia sentences (Google FLEURS) | 50 clean English | Recognized public corpus; number-heavy text. |
| MultiMed | Medical / scientific talks | 50 clean English | Domain vocabulary; where providers separate most. |
Speko runs the benchmark and sells a routing product.
State it plainly: Speko sells a voice-routing product, so a benchmark we publish has a commercial stake. We don't pretend otherwise. Here is what keeps it trustworthy in spite of that:
- Native APIs, not our gateway. Each provider is measured through its own transcription endpoint with our own key, so the Speko gateway is not in the measurement path and cannot tilt a result.
- Public corpora, open scorer. VoxPopuli, FLEURS, and MultiMed are recognized public datasets; WER is a standard metric and the exact normalizer is documented above — not a private scoring service.
- Identical treatment. Same clips, temperature 0, no prompt biasing for anyone. No provider gets a domain hint another doesn't.
- Re-scorable. Per-clip transcripts are saved, so the numbers can be independently re-scored — which is how the 2026-06-01 normalizer fix was applied to history without re-running anything.
What v1 does NOT claim.
The scope of v1 is narrow on purpose. Reading a chart as broader than this list is a misread, not a finding.
- — English only. Every published number is clean English; other languages are not yet measured.
- — Three corpora only: VoxPopuli (parliamentary), FLEURS (read), MultiMed (medical). Spontaneous, conversational, and noisy real-world speech are out of scope.
- — n=150 for the headline WER. Confidence intervals are wide (~±2 pts), so the top two providers are a statistical tie — read the ranking as tiers, not places.
- — Latency is end-to-end from a single location, so it includes network RTT. Per-region p99 ships later with a multi-region probe fleet — a Tokyo number must not be compared with a us-east-1 one.
- — Only two territories are measured: headline WER and silence-hallucination (Forensics). The other territories show "not yet measured" until their probes run — we never fill the gap with a guess.
- — The robustness/degradation profile uses ffmpeg-derived synthetic noise (phone codec, white/pink noise, reverb), not field recordings — treat it as indicative of synthetic degradation, not real-world acoustics.