Video Full-Duplex Protocol
Realtime full-duplex API protocol with video frames
This document defines the WebSocket protocol for Video Full-Duplex mode. For audio-only mode, see audio-duplex-protocol.md.
One-Sentence Summary
One WebSocket connection: the client sends one second of audio plus one video frame every second, and the server streams audio and text back in real time.
Mode Constraints
| Item | Value |
|---|---|
| Endpoint | wss://host/v1/realtime?mode=video |
| Frame format | JSON text frames |
| Input audio | 16 kHz, mono, float32 PCM, base64 encoded |
| Output audio | 24 kHz, mono, float32 PCM, base64 encoded |
| Input video | JPEG, base64 encoded, recommended on every append |
| Total session duration limit | 300 seconds (5 minutes), including waiting and idle time |
| Effective conversation time | about 90 seconds. The total duration includes queueing, initialization, and user silence |
| Context window | 8192 tokens, fixed. The server closes the session when full |
Lifecycle
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Setup │ ───→ │ Stream │ ───→ │ Close │
└─────────┘ └─────────┘ └─────────┘Setup
Connection URL: wss://host/v1/realtime?mode=video
session_id is generated by the server in the format rt_{timestamp_ms} and returned by session.created.
Client ──WSS──→ Server
← session.queued (optional: only when queued)
← session.queue_update (optional: 0..N updates)
← session.queue_done (required: worker assigned)
Client → session.update "Use this prompt and voice"
Server → session.created "Ready, session_id=xxx"Important: the client must wait for
session.queue_donebefore sendingsession.update.session.queue_doneis a required event. If a worker is already idle, the server sends it immediately.
Stream
Two independent streams run at the same time without blocking each other:
Upstream (Client → Server): Downstream (Server → Client):
Every second, send: The model may push at any time:
- 1 second of 16kHz audio - 24kHz response audio chunks
- 1 JPEG video frame - response text
- listen-state signalsClose
There are three close reasons:
| reason | Triggered By | Meaning |
|---|---|---|
user_stop | Client | User ends the session |
timeout | Server | Total session duration reaches 300 seconds |
context_full | Server | The 8192-token context window is full |
Event Overview
Client → Server
| Event | When to Send | Payload |
|---|---|---|
session.update | Once at startup | System prompt and reference voice |
input_audio_buffer.append | Once per second | 1 second of audio + 1 video frame |
session.close | When the client wants to end | Close reason |
Server → Client
| Event | Meaning | Payload |
|---|---|---|
session.created | Session is configured | session_id |
response.output_audio.delta | The model is speaking | Audio + text + end_of_turn |
response.listen | The model is listening | Monitoring data such as KV cache length |
Auxiliary events: session.queued, session.queue_update, session.queue_done, session.closed, error.
Message Format
session.update (Client → Server)
{
"type": "session.update",
"session": {
"instructions": "You are a helpful English assistant.",
"max_slice_nums": 1,
"ref_audio": "<base64 WAV, 16kHz>",
"tts_ref_audio": "<base64 WAV, 16kHz>"
}
}| Field | Type | Required | Description |
|---|---|---|---|
instructions | string | Yes | System prompt |
max_slice_nums | int | No | Maximum video slices. 1 = fast, 64 tokens/frame; 4 = detailed, 192 tokens/frame. Default: 1 |
ref_audio | string | No | LLM reference audio (base64 WAV, 16kHz), used for semantic/style cloning |
tts_ref_audio | string | No | TTS reference audio (base64 WAV, 16kHz), used for acoustic voice cloning. Falls back to ref_audio when omitted |
input_audio_buffer.append (Client → Server)
{
"type": "input_audio_buffer.append",
"audio": "<base64, 16000 samples = 1s, float32 PCM>",
"video_frames": ["<base64 JPEG>"],
"force_listen": false,
"max_slice_nums": 1
}| Field | Type | Required | Description |
|---|---|---|---|
audio | string | Yes | 16 kHz mono float32 PCM. One second = 16000 samples = 64000 bytes, base64 encoded. Minimum: 4000 samples (250ms) |
video_frames | string[] | No | JPEG frame list, usually one frame, base64 encoded. Recommended on every append in video mode; behavior is undefined if omitted |
force_listen | bool | No | Force the model into listen state, interrupting speech. Default: false |
max_slice_nums | int | No | Override the video slice count for this chunk (1 to 9) |
session.close (Client → Server)
{
"type": "session.close",
"reason": "user_stop"
}| Field | Type | Required | Description |
|---|---|---|---|
reason | string | No | Close reason. Use "user_stop" for user-initiated close |
session.created (Server → Client)
{
"type": "session.created",
"session_id": "rt_1714200000000",
"prompt_length": 256
}response.output_audio.delta (Server → Client)
{
"type": "response.output_audio.delta",
"text": "The weather is nice today",
"audio": "<base64, 24000 samples = 1s, float32 PCM>",
"end_of_turn": false,
"kv_cache_length": 1024
}| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text fragment generated in this delta |
audio | string | Yes | 24 kHz mono float32 PCM, base64 encoded |
end_of_turn | bool | Yes | Whether this turn has ended. true means the model has finished the utterance and will switch back to listening |
kv_cache_length | int | Yes | Current KV cache token count. Limit: 8192 |
Text/Audio Alignment
Because of the model architecture, text generation leads audio synthesis. The text and audio fields inside the same output_audio.delta are not strictly synchronized; text usually leads audio by several hundred milliseconds.
Example:
// delta 1
{ "text": "The weather is nice", "audio": "<audio: 'The weather'>" }
// delta 2
{ "text": " today", "audio": "<audio: 'is nice today'>" }Clients should use audio playback progress as the primary user experience timeline. Text can be shown as preview text or subtitles.
Output Audio Length
- Middle deltas: audio is exactly 1 second (24000 samples).
- First delta: audio may be shorter than 1 second.
- Last delta (
end_of_turn=true): audio may be shorter than 1 second.
response.listen (Server → Client)
{
"type": "response.listen",
"kv_cache_length": 1024
}The model is currently listening. The client should stop any remaining queued playback audio when this event is received.
session.closed (Server → Client)
{
"type": "session.closed",
"reason": "timeout"
}| reason | Meaning |
|---|---|
stopped | Confirmation after the client sends session.close |
timeout | Total session duration reached 300 seconds in video mode |
context_full | 8192-token context window is full |
server_shutdown | Server is shutting down |
error | Session terminated by an unrecoverable error |
Complete Timeline
Time ──────────────────────────────────────────────────────────→
┌─────────────────────────────────┐
│ Phase 1: Connect & Queue │
└─────────────────────────────────┘
Client: WSS Connect ─────→
← Server: session.queued
← Server: session.queue_update
← Server: session.queue_done
The client should not send messages during queueing.
If a worker is immediately available, queueing finishes immediately.
┌─────────────────────────────────┐
│ Phase 2: Session Init │
└─────────────────────────────────┘
Client: session.update ─┐
Server: session.created ←┘
┌─────────────────────────────────┐
│ Phase 3: Full-Duplex Stream │
└─────────────────────────────────┘
Client: append(audio1 + frame1) ──→
Client: append(audio2 + frame2) ──→
Client: append(audio3 + frame3) ──→ ← Server: listen
Client: append(audio4 + frame4) ──→
Client: append(audio5 + frame5) ──→ ← Server: listen
Client: append(audio6 + frame6) ──→ ← Server: output_audio.delta("Hello", audio, end_of_turn=false)
Client: append(audio7 + frame7) ──→ ← Server: output_audio.delta(",", audio, end_of_turn=false)
Client: append(audio8 + frame8) ──→ ← Server: output_audio.delta("how can I help?", audio, end_of_turn=true)
Client: append(audio9 + frame9) ──→ ← Server: listen
...
┌─────────────────────────────────┐
│ Phase 4: Close │
└─────────────────────────────────┘
Any condition may trigger close:
- client sends session.close
- total duration >= 300s → timeout
- KV cache >= 8192 → context_full
Client: session.close ──→
← Server: session.closed {reason: "stopped"}The client continues sending append events regardless of whether the server is listening or speaking. That is the full-duplex behavior.
Features Not Included as Protocol Events
| Feature | Why No Separate Event Is Needed |
|---|---|
| Pause/Resume | Stopping append is equivalent to pausing; the model keeps listening |
| Cancel Generation | Use force_listen=true inside append to interrupt speech |
| Response Done Marker | end_of_turn=true already marks the end; no response.done event is needed |
| Context Window Configuration | Fixed at 8192 tokens |
State Machine
connect
│
▼
┌──── QUEUED ─────┐
│ │ waiting for worker assignment
│ │ client must not send messages
│ │ may receive session.queued / session.queue_update
└────────┬─────────┘
│ receives session.queue_done
▼
┌─── CONNECTED ───┐
│ │ only session.update is allowed
│ │ otherwise → error (invalid_event)
└────────┬─────────┘
│ receives session.created
▼
┌──── ACTIVE ─────┐
│ │ append (video_frames recommended) / close allowed
│ │ append may include force_listen=true
└────────┬─────────┘
│ close / timeout / context_full / error
▼
CLOSEDError Codes
Client Errors
| code | Meaning | WS Close |
|---|---|---|
not_ready | Data sent before the session is ready | No |
unknown_event | Unknown event type | No |
missing_field | Required field missing | No |
invalid_payload | Invalid field value, such as base64/JPEG decode failure | No |
Server Errors
| code | Meaning | WS Close |
|---|---|---|
service_unavailable | Service is not ready | Yes (1013) |
queue_full | Queue is full | Yes (1013) |
worker_busy | No idle worker available | Yes (1013) |
worker_connect_failed | Worker connection failed | Yes (1013) |
inference_error | Inference failed | No, recoverable |
Error Message Format
{
"type": "error",
"error": {
"code": "missing_field",
"message": "audio field is required",
"type": "client_error"
}
}Silent Drop
If the client sends chunks too quickly, the server may drop old chunks without returning an error. This ensures the newest chunk is always preferred.
Invalid JSON
If a WebSocket frame cannot be parsed as JSON, the connection is closed with close code 1003.