Video Full-Duplex Protocol

This document defines the WebSocket protocol for Video Full-Duplex mode. For audio-only mode, see audio-duplex-protocol.md.

One-Sentence Summary

One WebSocket connection: the client sends one second of audio plus one video frame every second, and the server streams audio and text back in real time.

Mode Constraints

Item	Value
Endpoint	`wss://host/v1/realtime?mode=video`
Frame format	JSON text frames
Input audio	16 kHz, mono, float32 PCM, base64 encoded
Output audio	24 kHz, mono, float32 PCM, base64 encoded
Input video	JPEG, base64 encoded, recommended on every append
Total session duration limit	300 seconds (5 minutes), including waiting and idle time
Effective conversation time	about 90 seconds. The total duration includes queueing, initialization, and user silence
Context window	8192 tokens, fixed. The server closes the session when full

Lifecycle

┌─────────┐      ┌─────────┐      ┌─────────┐
│ Setup   │ ───→ │ Stream  │ ───→ │ Close   │
└─────────┘      └─────────┘      └─────────┘

Setup

Connection URL: wss://host/v1/realtime?mode=video

session_id is generated by the server in the format rt_{timestamp_ms} and returned by session.created.

Client  ──WSS──→  Server
                  ← session.queued          (optional: only when queued)
                  ← session.queue_update    (optional: 0..N updates)
                  ← session.queue_done      (required: worker assigned)
Client  → session.update      "Use this prompt and voice"
Server  → session.created     "Ready, session_id=xxx"

Important: the client must wait for session.queue_done before sending session.update. session.queue_done is a required event. If a worker is already idle, the server sends it immediately.

Stream

Two independent streams run at the same time without blocking each other:

Upstream (Client → Server):          Downstream (Server → Client):
Every second, send:                  The model may push at any time:
  - 1 second of 16kHz audio            - 24kHz response audio chunks
  - 1 JPEG video frame                 - response text
                                       - listen-state signals

Close

There are three close reasons:

reason	Triggered By	Meaning
`user_stop`	Client	User ends the session
`timeout`	Server	Total session duration reaches 300 seconds
`context_full`	Server	The 8192-token context window is full

Event Overview

Client → Server

Event	When to Send	Payload
`session.update`	Once at startup	System prompt and reference voice
`input_audio_buffer.append`	Once per second	1 second of audio + 1 video frame
`session.close`	When the client wants to end	Close reason

Server → Client

Event	Meaning	Payload
`session.created`	Session is configured	`session_id`
`response.output_audio.delta`	The model is speaking	Audio + text + `end_of_turn`
`response.listen`	The model is listening	Monitoring data such as KV cache length

Auxiliary events: session.queued, session.queue_update, session.queue_done, session.closed, error.

Message Format

session.update (Client → Server)

{
    "type": "session.update",
    "session": {
        "instructions": "You are a helpful English assistant.",
        "max_slice_nums": 1,
        "ref_audio": "<base64 WAV, 16kHz>",
        "tts_ref_audio": "<base64 WAV, 16kHz>"
    }
}

Field	Type	Required	Description
`instructions`	string	Yes	System prompt
`max_slice_nums`	int	No	Maximum video slices. `1` = fast, 64 tokens/frame; `4` = detailed, 192 tokens/frame. Default: 1
`ref_audio`	string	No	LLM reference audio (base64 WAV, 16kHz), used for semantic/style cloning
`tts_ref_audio`	string	No	TTS reference audio (base64 WAV, 16kHz), used for acoustic voice cloning. Falls back to `ref_audio` when omitted

input_audio_buffer.append (Client → Server)

{
    "type": "input_audio_buffer.append",
    "audio": "<base64, 16000 samples = 1s, float32 PCM>",
    "video_frames": ["<base64 JPEG>"],
    "force_listen": false,
    "max_slice_nums": 1
}

Field	Type	Required	Description
`audio`	string	Yes	16 kHz mono float32 PCM. One second = 16000 samples = 64000 bytes, base64 encoded. Minimum: 4000 samples (250ms)
`video_frames`	string[]	No	JPEG frame list, usually one frame, base64 encoded. Recommended on every append in video mode; behavior is undefined if omitted
`force_listen`	bool	No	Force the model into listen state, interrupting speech. Default: false
`max_slice_nums`	int	No	Override the video slice count for this chunk (`1` to `9`)

session.close (Client → Server)

{
    "type": "session.close",
    "reason": "user_stop"
}

Field	Type	Required	Description
`reason`	string	No	Close reason. Use `"user_stop"` for user-initiated close

session.created (Server → Client)

{
    "type": "session.created",
    "session_id": "rt_1714200000000",
    "prompt_length": 256
}

response.output_audio.delta (Server → Client)

{
    "type": "response.output_audio.delta",
    "text": "The weather is nice today",
    "audio": "<base64, 24000 samples = 1s, float32 PCM>",
    "end_of_turn": false,
    "kv_cache_length": 1024
}

Field	Type	Required	Description
`text`	string	Yes	Text fragment generated in this delta
`audio`	string	Yes	24 kHz mono float32 PCM, base64 encoded
`end_of_turn`	bool	Yes	Whether this turn has ended. `true` means the model has finished the utterance and will switch back to listening
`kv_cache_length`	int	Yes	Current KV cache token count. Limit: 8192

Because of the model architecture, text generation leads audio synthesis. The text and audio fields inside the same output_audio.delta are not strictly synchronized; text usually leads audio by several hundred milliseconds.

Example:

// delta 1
{ "text": "The weather is nice", "audio": "<audio: 'The weather'>" }

// delta 2
{ "text": " today", "audio": "<audio: 'is nice today'>" }

Clients should use audio playback progress as the primary user experience timeline. Text can be shown as preview text or subtitles.

Output Audio Length

Middle deltas: audio is exactly 1 second (24000 samples).
First delta: audio may be shorter than 1 second.
Last delta (end_of_turn=true): audio may be shorter than 1 second.

response.listen (Server → Client)

{
    "type": "response.listen",
    "kv_cache_length": 1024
}

The model is currently listening. The client should stop any remaining queued playback audio when this event is received.

session.closed (Server → Client)

{
    "type": "session.closed",
    "reason": "timeout"
}

reason	Meaning
`stopped`	Confirmation after the client sends `session.close`
`timeout`	Total session duration reached 300 seconds in video mode
`context_full`	8192-token context window is full
`server_shutdown`	Server is shutting down
`error`	Session terminated by an unrecoverable error

Complete Timeline

Time ──────────────────────────────────────────────────────────→

                           ┌─────────────────────────────────┐
                           │  Phase 1: Connect & Queue       │
                           └─────────────────────────────────┘
Client:  WSS Connect ─────→
                           ← Server: session.queued
                           ← Server: session.queue_update
                           ← Server: session.queue_done

  The client should not send messages during queueing.
  If a worker is immediately available, queueing finishes immediately.

                           ┌─────────────────────────────────┐
                           │  Phase 2: Session Init          │
                           └─────────────────────────────────┘
Client:  session.update ─┐
Server:  session.created ←┘

                           ┌─────────────────────────────────┐
                           │  Phase 3: Full-Duplex Stream    │
                           └─────────────────────────────────┘
Client:  append(audio1 + frame1) ──→
Client:  append(audio2 + frame2) ──→
Client:  append(audio3 + frame3) ──→     ← Server: listen
Client:  append(audio4 + frame4) ──→
Client:  append(audio5 + frame5) ──→     ← Server: listen
Client:  append(audio6 + frame6) ──→     ← Server: output_audio.delta("Hello", audio, end_of_turn=false)
Client:  append(audio7 + frame7) ──→     ← Server: output_audio.delta(",", audio, end_of_turn=false)
Client:  append(audio8 + frame8) ──→     ← Server: output_audio.delta("how can I help?", audio, end_of_turn=true)
Client:  append(audio9 + frame9) ──→     ← Server: listen
...

                           ┌─────────────────────────────────┐
                           │  Phase 4: Close                 │
                           └─────────────────────────────────┘
                           Any condition may trigger close:
                           - client sends session.close
                           - total duration >= 300s → timeout
                           - KV cache >= 8192 → context_full

Client:  session.close ──→
                           ← Server: session.closed {reason: "stopped"}

The client continues sending append events regardless of whether the server is listening or speaking. That is the full-duplex behavior.

Features Not Included as Protocol Events

Feature	Why No Separate Event Is Needed
Pause/Resume	Stopping `append` is equivalent to pausing; the model keeps listening
Cancel Generation	Use `force_listen=true` inside append to interrupt speech
Response Done Marker	`end_of_turn=true` already marks the end; no `response.done` event is needed
Context Window Configuration	Fixed at 8192 tokens

State Machine

          connect
             │
             ▼
    ┌──── QUEUED ─────┐
    │                  │    waiting for worker assignment
    │                  │    client must not send messages
    │                  │    may receive session.queued / session.queue_update
    └────────┬─────────┘
             │ receives session.queue_done
             ▼
    ┌─── CONNECTED ───┐
    │                  │    only session.update is allowed
    │                  │    otherwise → error (invalid_event)
    └────────┬─────────┘
             │ receives session.created
             ▼
    ┌──── ACTIVE ─────┐
    │                  │    append (video_frames recommended) / close allowed
    │                  │    append may include force_listen=true
    └────────┬─────────┘
             │ close / timeout / context_full / error
             ▼
         CLOSED

Error Codes

Client Errors

code	Meaning	WS Close
`not_ready`	Data sent before the session is ready	No
`unknown_event`	Unknown event type	No
`missing_field`	Required field missing	No
`invalid_payload`	Invalid field value, such as base64/JPEG decode failure	No

Server Errors

code	Meaning	WS Close
`service_unavailable`	Service is not ready	Yes (1013)
`queue_full`	Queue is full	Yes (1013)
`worker_busy`	No idle worker available	Yes (1013)
`worker_connect_failed`	Worker connection failed	Yes (1013)
`inference_error`	Inference failed	No, recoverable

Error Message Format

{
    "type": "error",
    "error": {
        "code": "missing_field",
        "message": "audio field is required",
        "type": "client_error"
    }
}

Silent Drop

If the client sends chunks too quickly, the server may drop old chunks without returning an error. This ensures the newest chunk is always preferred.

Invalid JSON

If a WebSocket frame cannot be parsed as JSON, the connection is closed with close code 1003.