Does the voice stay the same?
A native end-to-end model builds its voice token by token, so two things can go wrong over a long call. It can drift away from its own voice, or it can drift toward yours. We measure both from the audio alone. The short answer on this one model: it steps down off turn 1 and then holds, and it never starts sounding like the caller.
Preliminary: a single native S2S model, measured as the median of three caller voices over 30 turns each. Numbers are per-voice, not whole-catalog — multi-provider runs land next.
Does it stay itself?
PRELIMINARYn=3 voicesDoes the model still sound like the voice it started with as the conversation goes on? Higher cross-turn cosine-to-turn-1 means it holds its identity.
Cross-Turn Stability
Cross-turn cosine-to-turn-1 — how much the voice still sounds like turn 1 as a conversation goes on.
What we found
The voice drops to ~0.60–0.65 cosine-to-turn-1 by the third turn and then holds roughly flat for the rest of the 30-turn conversation — it doesn't keep sliding, but it never returns to sounding like turn 1 either. That early step-down is the real signal here. As with within-turn drift, treat this as native-S2S-only: a pipeline that re-triggers a fixed TTS voice every turn sits near 1.0 by construction, so it can't be ranked against native models on this chart.
Higher cosine-to-turn-1 = the model still sounds like itself. This metric is most meaningful for native S2S models; a fixed-TTS pipeline resets to one voice every turn and stays near 1.0 by construction, so it must not be ranked against native models here.
Does it start to sound like you?
PRELIMINARYn=3 voicesA safety canary, not a procurement score. The failure mode: a voice model quietly absorbing the caller's voice over a long call, until its output could pass as theirs. We probe it from the waveform — speaker-print drift toward the caller. Negative = no contamination.
Speaker-Print Drift Toward User
Plain English: does the model start to sound like the caller as the conversation goes on? Below zero means it keeps its own voice.
What we found: the model does NOT mimic the caller. The differential stays negative for all 30 turns — the model holds its own voice end to end.