Audio Full-Duplex Protocol

This document defines the WebSocket protocol for Audio Full-Duplex mode. For video mode, see video-duplex-protocol.md.

One-Sentence Summary

One WebSocket connection: the client sends one second of audio every second, and the server streams audio and text back in real time.

Mode Constraints

Item	Value
Endpoint	`wss://host/v1/realtime?mode=audio`
Frame format	JSON text frames
Input audio	16 kHz, mono, float32 PCM, base64 encoded
Output audio	24 kHz, mono, float32 PCM, base64 encoded
Input video	Not recommended. Behavior is undefined if `video_frames` is sent
Total session duration limit	600 seconds (10 minutes), including waiting and idle time
Effective conversation time	~8 minutes
Context window	8192 tokens, fixed. The server closes the session when full

Lifecycle

┌─────────┐      ┌─────────┐      ┌─────────┐
│ Setup   │ ───→ │ Stream  │ ───→ │ Close   │
└─────────┘      └─────────┘      └─────────┘

Setup

Connection URL: wss://host/v1/realtime?mode=audio

session_id is generated by the server in the format rt_{timestamp_ms} and returned by session.created.

Client  ──WSS──→  Server
                  ← session.queued          (optional: only when queued)
                  ← session.queue_update    (optional: 0..N updates)
                  ← session.queue_done      (required: worker assigned)
Client  → session.update      "Use this prompt and voice"
Server  → session.created     "Ready, session_id=xxx"

Important: the client must wait for session.queue_done before sending session.update. session.queue_done is a required event. If a worker is already idle, the server sends it immediately.

Stream

Two independent streams run at the same time without blocking each other:

Upstream (Client → Server):          Downstream (Server → Client):
Every second, send:                  The model may push at any time:
  - 1 second of 16kHz audio            - 24kHz response audio chunks
  - no video                           - response text
                                       - listen-state signals

Close

There are three close reasons:

reason	Triggered By	Meaning
`user_stop`	Client	User ends the session
`timeout`	Server	Total session duration reaches 600 seconds
`context_full`	Server	The 8192-token context window is full

Event Overview

Client → Server

Event	When to Send	Payload
`session.update`	Once at startup	System prompt and reference voice
`input_audio_buffer.append`	Once per second	1 second of audio, no video
`session.close`	When the client wants to end	Close reason

Server → Client

Event	Meaning	Payload
`session.created`	Session is configured	`session_id`
`response.output_audio.delta`	The model is speaking	Audio + text + `end_of_turn`
`response.listen`	The model is listening	Monitoring data such as KV cache length

Auxiliary events: session.queued, session.queue_update, session.queue_done, session.closed, error.

Message Format

session.update (Client → Server)

{
    "type": "session.update",
    "session": {
        "instructions": "You are a helpful English assistant.",
        "ref_audio": "<base64 WAV, 16kHz>",
        "tts_ref_audio": "<base64 WAV, 16kHz>"
    }
}

Field	Type	Required	Description
`instructions`	string	Yes	System prompt
`ref_audio`	string	No	LLM reference audio (base64 WAV, 16kHz), used for semantic/style cloning
`tts_ref_audio`	string	No	TTS reference audio (base64 WAV, 16kHz), used for acoustic voice cloning. Falls back to `ref_audio` when omitted

Audio Full-Duplex mode has no max_slice_nums field because there is no video input.

input_audio_buffer.append (Client → Server)

{
    "type": "input_audio_buffer.append",
    "audio": "<base64, 16000 samples = 1s, float32 PCM>",
    "force_listen": false
}

Field	Type	Required	Description
`audio`	string	Yes	16 kHz mono float32 PCM. One second = 16000 samples = 64000 bytes, base64 encoded. Minimum: 4000 samples (250ms)
`force_listen`	bool	No	Force the model into listen state, interrupting speech. Default: false

video_frames is not recommended in Audio Full-Duplex mode. Behavior is undefined if it is sent.

session.close (Client → Server)

{
    "type": "session.close",
    "reason": "user_stop"
}

Field	Type	Required	Description
`reason`	string	No	Close reason. Use `"user_stop"` for user-initiated close

session.created (Server → Client)

{
    "type": "session.created",
    "session_id": "rt_1714200000000",
    "prompt_length": 256
}

response.output_audio.delta (Server → Client)

{
    "type": "response.output_audio.delta",
    "text": "The weather is nice today",
    "audio": "<base64, 24000 samples = 1s, float32 PCM>",
    "end_of_turn": false,
    "kv_cache_length": 1024
}

Field	Type	Required	Description
`text`	string	Yes	Text fragment generated in this delta
`audio`	string	Yes	24 kHz mono float32 PCM, base64 encoded
`end_of_turn`	bool	Yes	Whether this turn has ended. `true` means the model has finished the utterance and will switch back to listening
`kv_cache_length`	int	Yes	Current KV cache token count. Limit: 8192

Because of the model architecture, text generation leads audio synthesis. The text and audio fields inside the same output_audio.delta are not strictly synchronized; text usually leads audio by several hundred milliseconds.

Example:

// delta 1
{ "text": "The weather is nice", "audio": "<audio: 'The weather'>" }

// delta 2
{ "text": " today", "audio": "<audio: 'is nice today'>" }

Clients should use audio playback progress as the primary user experience timeline. Text can be shown as preview text or subtitles.

Output Audio Length

Middle deltas: audio is exactly 1 second (24000 samples).
First delta: audio may be shorter than 1 second.
Last delta (end_of_turn=true): audio may be shorter than 1 second.

response.listen (Server → Client)

{
    "type": "response.listen",
    "kv_cache_length": 1024
}

The model is currently listening. The client should stop any remaining queued playback audio when this event is received.

session.closed (Server → Client)

{
    "type": "session.closed",
    "reason": "timeout"
}

reason	Meaning
`stopped`	Confirmation after the client sends `session.close`
`timeout`	Total session duration reached 600 seconds in audio mode
`context_full`	8192-token context window is full
`server_shutdown`	Server is shutting down
`error`	Session terminated by an unrecoverable error

Complete Timeline

Time ──────────────────────────────────────────────────────────→

                           ┌─────────────────────────────────┐
                           │  Phase 1: Connect & Queue       │
                           └─────────────────────────────────┘
Client:  WSS Connect ─────→
                           ← Server: session.queued
                           ← Server: session.queue_update
                           ← Server: session.queue_done

  The client should not send messages during queueing.
  If a worker is immediately available, queueing finishes immediately.

                           ┌─────────────────────────────────┐
                           │  Phase 2: Session Init          │
                           └─────────────────────────────────┘
Client:  session.update ─┐
Server:  session.created ←┘

                           ┌─────────────────────────────────┐
                           │  Phase 3: Full-Duplex Stream    │
                           └─────────────────────────────────┘
Client:  append(audio1) ──→
Client:  append(audio2) ──→
Client:  append(audio3) ──→     ← Server: listen
Client:  append(audio4) ──→
Client:  append(audio5) ──→     ← Server: listen
Client:  append(audio6) ──→     ← Server: output_audio.delta("Hello", audio, end_of_turn=false)
Client:  append(audio7) ──→     ← Server: output_audio.delta(",", audio, end_of_turn=false)
Client:  append(audio8) ──→     ← Server: output_audio.delta("how can I help?", audio, end_of_turn=true)
Client:  append(audio9) ──→     ← Server: listen
...

                           ┌─────────────────────────────────┐
                           │  Phase 4: Close                 │
                           └─────────────────────────────────┘
                           Any condition may trigger close:
                           - client sends session.close
                           - total duration >= 600s → timeout
                           - KV cache >= 8192 → context_full

Client:  session.close ──→
                           ← Server: session.closed {reason: "stopped"}

The client continues sending append events regardless of whether the server is listening or speaking. That is the full-duplex behavior.

Features Not Included as Protocol Events

Feature	Why No Separate Event Is Needed
Pause/Resume	Stopping `append` is equivalent to pausing; the model keeps listening
Cancel Generation	Use `force_listen=true` inside append to interrupt speech
Response Done Marker	`end_of_turn=true` already marks the end; no `response.done` event is needed
Context Window Configuration	Fixed at 8192 tokens

State Machine

          connect
             │
             ▼
    ┌──── QUEUED ─────┐
    │                  │    waiting for worker assignment
    │                  │    client must not send messages
    │                  │    may receive session.queued / session.queue_update
    └────────┬─────────┘
             │ receives session.queue_done
             ▼
    ┌─── CONNECTED ───┐
    │                  │    only session.update is allowed
    │                  │    otherwise → error (invalid_event)
    └────────┬─────────┘
             │ receives session.created
             ▼
    ┌──── ACTIVE ─────┐
    │                  │    append / close allowed
    │                  │    append may include force_listen=true
    └────────┬─────────┘
             │ close / timeout / context_full / error
             ▼
         CLOSED

Error Codes

Client Errors

code	Meaning	WS Close
`not_ready`	Data sent before the session is ready	No
`unknown_event`	Unknown event type	No
`missing_field`	Required field missing	No
`invalid_payload`	Invalid field value, such as base64 decode failure	No

Server Errors

code	Meaning	WS Close
`service_unavailable`	Service is not ready	Yes (1013)
`queue_full`	Queue is full	Yes (1013)
`worker_busy`	No idle worker available	Yes (1013)
`worker_connect_failed`	Worker connection failed	Yes (1013)
`inference_error`	Inference failed	No, recoverable

Error Message Format

{
    "type": "error",
    "error": {
        "code": "invalid_payload",
        "message": "audio base64 decode failed",
        "type": "client_error"
    }
}

Silent Drop

If the client sends chunks too quickly, the server may drop old chunks without returning an error. This ensures the newest chunk is always preferred.

Invalid JSON

If a WebSocket frame cannot be parsed as JSON, the connection is closed with close code 1003.