← all posts

The audio context isn't really 128K tokens

It's about 5 minutes of conversation. Past that, the model accepts your audio and returns nothing. We pushed gpt-realtime-2 through a 60-turn English protocol; it crashed at turn 32, 5 minutes of accumulated assistant audio in. gpt-realtime ran the same prompts comfortably.

OpenAI’s gpt-realtime-2 ships with a 128K context window. We extended our standard 30-turn S2S benchmark to 60 turns in English, kept the rest of the harness identical, and watched.

At turn 32, with 5 minutes 0 seconds of accumulated assistant audio behind us, the model accepted our user audio chunk, returned no audio frames, and a moment later the WebSocket disconnected with no close frame. No response.failed event. No conversation-truncation event. No error message. The session went from “producing 10–15 second responses for the past several turns” to “completed an empty response” to “socket closed” with no graceful intermediate state.

gpt-realtime, run against exactly the same prompts on the same Cinla voice through the same harness, finished its 30-turn protocol in 3.08 minutes total. Comfortable room left.

The audio context isn’t really 128K tokens. It’s about 5 minutes of conversation. Past that, the model accepts your audio and returns nothing.

What we actually saw

Same Cinla voice generating the same conversational prompts. Same Speko /v1/sessions path. Only the model name changes.

ModelTurns completedMedian per-turnP95 per-turnTotal assistant audio
gpt-realtime30 / 306.05 s8.15 s3.08 min
gpt-realtime-231 / 609.65 s16.15 s5.00 min

The two key columns are the last one and the first one. gpt-realtime-2 was nearly 60% more verbose per turn, and that verbosity put it past 5 minutes of accumulated assistant audio in less than 32 turns. It crashed there. Run shorter responses and it’ll go further. Run longer responses and it’ll crash sooner. The ceiling is in minutes of audio, not turns.

What “the model returns nothing” looks like on the wire

For the first 31 turns the protocol behaves normally — we stream user audio in, the model produces an audio response (anywhere from 3 to 17 seconds), we capture it, repeat.

At turn 32 the protocol does something we hadn’t seen before. We send our user audio. The WebSocket acknowledges it. The model’s response.created and response.done events both fire. But the model emits zero binary audio frames between those two events. A successful zero-byte response.

A few seconds after that, the WebSocket disconnects mid-flight with ConnectionClosedError: no close frame received or sent. No {t:"error"} event was issued from the server. No response.failed came through. No conversation-truncation event arrived. The session went from “producing 14-second responses” to “completed an empty response” to “socket closed” with no graceful intermediate state.

A buyer-relevant detail: there is no signal at the application layer that the session is dead. The first turn after the dead boundary returns successfully (zero bytes is, protocol-wise, a valid completion); your code sees response.done and moves on; the second turn after the boundary disconnects the socket and your code now has to mint a new session, losing all conversation state. The visible failure is a disconnect, but the actual failure happened the turn before, silently.

Why 5 minutes and not 40

A 128K context at ~50 audio tokens per second of speech is theoretically ~43 minutes of audio. The 5-minute ceiling we hit isn’t the documented context window — it’s an effective bound, somewhere between the model’s hard token limit and the practical attention quality OpenAI is willing to ship.

The architectural reason this happens earlier than the token math suggests:

  • Audio tokens have a much higher local-correlation density than text tokens. Attention over 5 minutes of contiguous audio is computationally and qualitatively harder than attention over the same token count of text.
  • OpenAI’s serving stack appears to have an internal threshold where, when accumulated audio context degrades response quality below some bar, the model returns an empty completion rather than a degraded one. That’s a sensible design choice for content safety. It’s a hostile design choice for session continuity if the application isn’t expecting it.

The failure mode is the diagnostic clue. A clean refusal would have come through as response.failed. A graceful drop of the oldest turns would have come through as conversation.item.truncate. A timeout would have come through as a disconnect with a close frame. Empty-completion-followed-by-no-close-frame matches a “the model produced no tokens and the inference path got into an unrecoverable state” failure, which is consistent with hitting an empirically-derived internal audio-context bound.

The buyer math

If you’re building on gpt-realtime-2 today, plan around ~5 minutes of total assistant audio per session. That converts into turn budgets that depend entirely on how chatty your model is at the prompt you give it.

Assistant response lengthApproximate turn budget
3 s per turn (terminal command, voice control)~100 turns
6 s per turn (Q&A, short reply)~50 turns
10 s per turn (English conversational)~30 turns
15 s per turn (expressive long-form)~20 turns
30 s per turn (storytelling, narration)~10 turns

If you need a longer session than that — a coaching call, a tutoring session, an actual phone agent that can pick up a thread for 15 minutes — you need to either:

  • Switch to gpt-realtime (smaller per-turn responses, lower ceiling but still passes 30 turns at conversational length comfortably)
  • Issue a fresh session every 4 minutes and re-inject conversation summary as context
  • Wait for OpenAI to ship a Realtime model whose effective audio context matches the published one

What we’d want from OpenAI

Three things would change this from a hostile failure into a debuggable one:

  1. A documented effective audio-context bound — the published 128K token figure doesn’t reflect what production sessions can actually use. A “tested up to N minutes of contiguous audio per session” number would let buyers plan around it.
  2. A response.failed event when the model produces an empty completion under context pressure — instead of a silent zero-byte success.
  3. A session.warning event before the boundary — applications would have one turn to gracefully end the conversation and warm a fresh session.

None of these requires changing model weights. They require changing what the API surfaces to the application.

What we’d want next from the benchmark

We’re going to formalize this as a deterministic metric on the comparison page:

  • Session capacity to silent failure — push each model with a fixed-prompt loop until the first zero-byte response, report the boundary in minutes.
  • Failure-mode taxonomy — distinguish graceful refusals, context truncations, zero-byte completions, and socket disconnects in the data. The dangerous one is zero-byte completions because they look successful to the application.

The harness, raw audio, and run dirs are reproducible — happy to share with anyone running their own version of this.

The benchmark is meant to grow. The rows are meant to multiply. The ceilings are meant to be measured.