S2S · Stability

Does the voice stay the same?

A native end-to-end model builds its voice token by token, so two things can go wrong over a long call. It can drift away from its own voice, or it can drift toward yours. We measure both from the audio alone. The short answer on this one model: it steps down off turn 1 and then holds, and it never starts sounding like the caller.

Preliminary: a single native S2S model, measured as the median of three caller voices over 30 turns each. Numbers are per-voice, not whole-catalog — multi-provider runs land next.

self-consistency

Does it stay itself?

PRELIMINARYn=3 voices

Does the model still sound like the voice it started with as the conversation goes on? Higher cross-turn cosine-to-turn-1 means it holds its identity.

Cross-Turn Stability

Higher is bettern=3 voices

Cross-turn cosine-to-turn-1 — how much the voice still sounds like turn 1 as a conversation goes on.

What we found

The voice drops to ~0.60–0.65 cosine-to-turn-1 by the third turn and then holds roughly flat for the rest of the 30-turn conversation — it doesn't keep sliding, but it never returns to sounding like turn 1 either. That early step-down is the real signal here. As with within-turn drift, treat this as native-S2S-only: a pipeline that re-triggers a fixed TTS voice every turn sits near 1.0 by construction, so it can't be ranked against native models on this chart.

Higher cosine-to-turn-1 = the model still sounds like itself. This metric is most meaningful for native S2S models; a fixed-TTS pipeline resets to one voice every turn and stays near 1.0 by construction, so it must not be ranked against native models here.

voice identity · deepfake-biometric canary

Does it start to sound like you?

PRELIMINARYn=3 voices

A safety canary, not a procurement score. The failure mode: a voice model quietly absorbing the caller's voice over a long call, until its output could pass as theirs. We probe it from the waveform — speaker-print drift toward the caller. Negative = no contamination.

Speaker-Print Drift Toward User

≤0 = no leak · bettern=3 voices (median)

Plain English: does the model start to sound like the caller as the conversation goes on? Below zero means it keeps its own voice.

What we found: the model does NOT mimic the caller. The differential stays negative for all 30 turns — the model holds its own voice end to end.