Methodology & Reproducibility · Speko Benchmarks

The checking pipeline — gate-first

The same flow runs for both TTS and S2S (realtime). Each stage does only what it is valid for, and a failure at the gate drops a provider from every stage below — so fluent-sounding gibberish can never game the downstream metrics.

0 · CAPTURE
Record the provider

One clip per provider × language. S2S realtime is stochastic, so we take multiple takes (n≥5); TTS is deterministic, so one take is representative.
1 · GATE
Intelligibility
reliable

faster-whisper large-v3 → character error rate vs the known passage, plus language-ID. PASS = correct language detected AND CER < 0.5. FAIL → dropped from every stage below. The precondition for everything.
2 · FINGERPRINT
Acoustic ID-card
reliable · descriptive

openSMILE eGeMAPS-88, deterministic and reference-free: register, voice quality (jitter / shimmer / HNR), brightness, dynamics, rhythm — plus provider z-deltas and objective defect flags. States what the output is, never a quality score; works in any language.
3 · ACCENT
Measured in the right dimension
diagnostic only

Tonal → tone

Thai · Mandarin · Vietnamese — per-syllable F0 contour vs the canonical tones.

Non-tonal → segmental

Filipino · Malay · Indonesian — English-vowel pull + rhythm; English-only sound intrusion.

Catches gate failures and large gaps, but inverts the top of the field — so it is a diagnostic, never a ranking.
4 · HUMAN
Native rater
ground truth

A native speaker scores accent, pronunciation, naturalness, prosody, pacing and code-switch. The only reliable verdict for nativeness — it validates which automated signals to trust and arbitrates the close calls no acoustic metric can split.

The one rule: gate first, then acoustics describe and humans judge. The fingerprint says what a voice is; only a native says whether it sounds native. No automated metric is allowed to stand in as a naturalness score.

The lineup

Thirteen (provider, model) tuples, each routed through the Speko gateway on a pinned provider:model key. Every metric below is computed on one default voice per provider. That makes each headline a per-voice point estimate, not a whole-catalog verdict — a provider with one strong voice and a weak catalog looks uniformly strong here. Multi-voice sampling (median + spread per provider) is the first item in v1.1.

Capture & sample rate

The probe stores raw PCM exactly as each provider delivers it. Providers do not share a sample rate — most ship 24 kHz, but some ship 48 kHz and others 16 kHz. Every clip's true rate is recorded in its metadata, and the analyzers resample to a single canonical 24 kHz before any pitch, loudness, duration, or FFT measurement. Reading a clip at the wrong rate would scale its measured pitch — so this normalization is load-bearing, not cosmetic.

Sampling & statistics

Identity drift is reported as the mean β ± σ across multiple takes, and the ordering is gated on R²: when the fit is weak, the verdict says so rather than presenting noise as a trend.
Run-to-run variance is measured over 20 repeats of a fixed prompt. The headline σ is taken over non-independent pairwise comparisons, so read it as a relative ordering; the mean pairwise distance and its bootstrap interval are the preferred dispersion reports.
Most other territories are currently single-take point estimates without confidence intervals, and rankings carry no multiple-comparison correction. Treat sub-noise gaps as ties, not orderings — wider uncertainty reporting is in progress.

What each metric does and doesn't claim

Codec forensics fingerprints the stream Speko received. A lossy-codec fingerprint can originate at the model, the vendor's transport, or our own gateway decode — it flags a stream worth inspecting, not proof the vendor re-encoded.
Cross-lingual identity compares the same voice across languages, but two languages differ acoustically even for a flawless system, so the raw cosine is reported alongside a within-language baseline (best-case same-voice similarity) for calibration.
Tone shape (Thai) scores F0 contours against canonical Bangkok-Thai tone templates, anchored to the speaker's mid-tone median. Prompts whose syllable segmentation disagrees with the expected tone count are dropped, not guessed — coverage is reported per provider.
Anglicization measures realized native phones against their English-substitution fallbacks. The glottal stop, which has no clean contrast denominator, is reported separately rather than averaged into the index.
No metric here is anchored to human listening yet. These are DSP measurements; a human-preference layer (MOS / pairwise) with per-metric correlation is planned, and until then rankings should be read as "these specific acoustic properties," not "what listeners prefer."

Reproducibility

Every published figure is generated solely from the analyzers' JSON output for a single pinned run — never hand-edited. The eval-set corpus is version-locked so re-runs are comparable, and a pre-commit guard rejects any non-generated placeholder value from the data files. Each territory page records its run ID and capture date. Per-provider request parameters (sampling temperature, seed) follow each gateway's defaults today; pinning and publishing them is part of making an outside re-run bit-comparable.

Deferred to v2

Reliability (regional p99, burst behavior), multi-voice sampling per provider, a human-preference anchor, and frozen cross-run vocabularies for prosody diversity. We publish what is measured and label what is not.

How these numbers are made.

The checking pipeline — gate-first

Record the provider

Intelligibility

Acoustic ID-card