MiniCPM-o Docs
Realtime API

Video Full-Duplex Protocol

Realtime full-duplex API protocol with video frames

This document defines the WebSocket protocol for Video Full-Duplex mode. For audio-only mode, see audio-duplex-protocol.md.


One-Sentence Summary

One WebSocket connection: the client sends one second of audio plus one video frame every second, and the server streams audio and text back in real time.


Mode Constraints

ItemValue
Endpointwss://host/v1/realtime?mode=video
Frame formatJSON text frames
Input audio16 kHz, mono, float32 PCM, base64 encoded
Output audio24 kHz, mono, float32 PCM, base64 encoded
Input videoJPEG, base64 encoded, recommended on every append
Total session duration limit300 seconds (5 minutes), including waiting and idle time
Effective conversation timeabout 90 seconds. The total duration includes queueing, initialization, and user silence
Context window8192 tokens, fixed. The server closes the session when full

Lifecycle

┌─────────┐      ┌─────────┐      ┌─────────┐
│ Setup   │ ───→ │ Stream  │ ───→ │ Close   │
└─────────┘      └─────────┘      └─────────┘

Setup

Connection URL: wss://host/v1/realtime?mode=video

session_id is generated by the server in the format rt_{timestamp_ms} and returned by session.created.

Client  ──WSS──→  Server
                  ← session.queued          (optional: only when queued)
                  ← session.queue_update    (optional: 0..N updates)
                  ← session.queue_done      (required: worker assigned)
Client  → session.update      "Use this prompt and voice"
Server  → session.created     "Ready, session_id=xxx"

Important: the client must wait for session.queue_done before sending session.update. session.queue_done is a required event. If a worker is already idle, the server sends it immediately.

Stream

Two independent streams run at the same time without blocking each other:

Upstream (Client → Server):          Downstream (Server → Client):
Every second, send:                  The model may push at any time:
  - 1 second of 16kHz audio            - 24kHz response audio chunks
  - 1 JPEG video frame                 - response text
                                       - listen-state signals

Close

There are three close reasons:

reasonTriggered ByMeaning
user_stopClientUser ends the session
timeoutServerTotal session duration reaches 300 seconds
context_fullServerThe 8192-token context window is full

Event Overview

Client → Server

EventWhen to SendPayload
session.updateOnce at startupSystem prompt and reference voice
input_audio_buffer.appendOnce per second1 second of audio + 1 video frame
session.closeWhen the client wants to endClose reason

Server → Client

EventMeaningPayload
session.createdSession is configuredsession_id
response.output_audio.deltaThe model is speakingAudio + text + end_of_turn
response.listenThe model is listeningMonitoring data such as KV cache length

Auxiliary events: session.queued, session.queue_update, session.queue_done, session.closed, error.


Message Format

session.update (Client → Server)

{
    "type": "session.update",
    "session": {
        "instructions": "You are a helpful English assistant.",
        "max_slice_nums": 1,
        "ref_audio": "<base64 WAV, 16kHz>",
        "tts_ref_audio": "<base64 WAV, 16kHz>"
    }
}
FieldTypeRequiredDescription
instructionsstringYesSystem prompt
max_slice_numsintNoMaximum video slices. 1 = fast, 64 tokens/frame; 4 = detailed, 192 tokens/frame. Default: 1
ref_audiostringNoLLM reference audio (base64 WAV, 16kHz), used for semantic/style cloning
tts_ref_audiostringNoTTS reference audio (base64 WAV, 16kHz), used for acoustic voice cloning. Falls back to ref_audio when omitted

input_audio_buffer.append (Client → Server)

{
    "type": "input_audio_buffer.append",
    "audio": "<base64, 16000 samples = 1s, float32 PCM>",
    "video_frames": ["<base64 JPEG>"],
    "force_listen": false,
    "max_slice_nums": 1
}
FieldTypeRequiredDescription
audiostringYes16 kHz mono float32 PCM. One second = 16000 samples = 64000 bytes, base64 encoded. Minimum: 4000 samples (250ms)
video_framesstring[]NoJPEG frame list, usually one frame, base64 encoded. Recommended on every append in video mode; behavior is undefined if omitted
force_listenboolNoForce the model into listen state, interrupting speech. Default: false
max_slice_numsintNoOverride the video slice count for this chunk (1 to 9)

session.close (Client → Server)

{
    "type": "session.close",
    "reason": "user_stop"
}
FieldTypeRequiredDescription
reasonstringNoClose reason. Use "user_stop" for user-initiated close

session.created (Server → Client)

{
    "type": "session.created",
    "session_id": "rt_1714200000000",
    "prompt_length": 256
}

response.output_audio.delta (Server → Client)

{
    "type": "response.output_audio.delta",
    "text": "The weather is nice today",
    "audio": "<base64, 24000 samples = 1s, float32 PCM>",
    "end_of_turn": false,
    "kv_cache_length": 1024
}
FieldTypeRequiredDescription
textstringYesText fragment generated in this delta
audiostringYes24 kHz mono float32 PCM, base64 encoded
end_of_turnboolYesWhether this turn has ended. true means the model has finished the utterance and will switch back to listening
kv_cache_lengthintYesCurrent KV cache token count. Limit: 8192

Text/Audio Alignment

Because of the model architecture, text generation leads audio synthesis. The text and audio fields inside the same output_audio.delta are not strictly synchronized; text usually leads audio by several hundred milliseconds.

Example:

// delta 1
{ "text": "The weather is nice", "audio": "<audio: 'The weather'>" }

// delta 2
{ "text": " today", "audio": "<audio: 'is nice today'>" }

Clients should use audio playback progress as the primary user experience timeline. Text can be shown as preview text or subtitles.

Output Audio Length

  • Middle deltas: audio is exactly 1 second (24000 samples).
  • First delta: audio may be shorter than 1 second.
  • Last delta (end_of_turn=true): audio may be shorter than 1 second.

response.listen (Server → Client)

{
    "type": "response.listen",
    "kv_cache_length": 1024
}

The model is currently listening. The client should stop any remaining queued playback audio when this event is received.

session.closed (Server → Client)

{
    "type": "session.closed",
    "reason": "timeout"
}
reasonMeaning
stoppedConfirmation after the client sends session.close
timeoutTotal session duration reached 300 seconds in video mode
context_full8192-token context window is full
server_shutdownServer is shutting down
errorSession terminated by an unrecoverable error

Complete Timeline

Time ──────────────────────────────────────────────────────────→

                           ┌─────────────────────────────────┐
                           │  Phase 1: Connect & Queue       │
                           └─────────────────────────────────┘
Client:  WSS Connect ─────→
                           ← Server: session.queued
                           ← Server: session.queue_update
                           ← Server: session.queue_done

  The client should not send messages during queueing.
  If a worker is immediately available, queueing finishes immediately.

                           ┌─────────────────────────────────┐
                           │  Phase 2: Session Init          │
                           └─────────────────────────────────┘
Client:  session.update ─┐
Server:  session.created ←┘

                           ┌─────────────────────────────────┐
                           │  Phase 3: Full-Duplex Stream    │
                           └─────────────────────────────────┘
Client:  append(audio1 + frame1) ──→
Client:  append(audio2 + frame2) ──→
Client:  append(audio3 + frame3) ──→     ← Server: listen
Client:  append(audio4 + frame4) ──→
Client:  append(audio5 + frame5) ──→     ← Server: listen
Client:  append(audio6 + frame6) ──→     ← Server: output_audio.delta("Hello", audio, end_of_turn=false)
Client:  append(audio7 + frame7) ──→     ← Server: output_audio.delta(",", audio, end_of_turn=false)
Client:  append(audio8 + frame8) ──→     ← Server: output_audio.delta("how can I help?", audio, end_of_turn=true)
Client:  append(audio9 + frame9) ──→     ← Server: listen
...

                           ┌─────────────────────────────────┐
                           │  Phase 4: Close                 │
                           └─────────────────────────────────┘
                           Any condition may trigger close:
                           - client sends session.close
                           - total duration >= 300s → timeout
                           - KV cache >= 8192 → context_full

Client:  session.close ──→
                           ← Server: session.closed {reason: "stopped"}

The client continues sending append events regardless of whether the server is listening or speaking. That is the full-duplex behavior.


Features Not Included as Protocol Events

FeatureWhy No Separate Event Is Needed
Pause/ResumeStopping append is equivalent to pausing; the model keeps listening
Cancel GenerationUse force_listen=true inside append to interrupt speech
Response Done Markerend_of_turn=true already marks the end; no response.done event is needed
Context Window ConfigurationFixed at 8192 tokens

State Machine

          connect


    ┌──── QUEUED ─────┐
    │                  │    waiting for worker assignment
    │                  │    client must not send messages
    │                  │    may receive session.queued / session.queue_update
    └────────┬─────────┘
             │ receives session.queue_done

    ┌─── CONNECTED ───┐
    │                  │    only session.update is allowed
    │                  │    otherwise → error (invalid_event)
    └────────┬─────────┘
             │ receives session.created

    ┌──── ACTIVE ─────┐
    │                  │    append (video_frames recommended) / close allowed
    │                  │    append may include force_listen=true
    └────────┬─────────┘
             │ close / timeout / context_full / error

         CLOSED

Error Codes

Client Errors

codeMeaningWS Close
not_readyData sent before the session is readyNo
unknown_eventUnknown event typeNo
missing_fieldRequired field missingNo
invalid_payloadInvalid field value, such as base64/JPEG decode failureNo

Server Errors

codeMeaningWS Close
service_unavailableService is not readyYes (1013)
queue_fullQueue is fullYes (1013)
worker_busyNo idle worker availableYes (1013)
worker_connect_failedWorker connection failedYes (1013)
inference_errorInference failedNo, recoverable

Error Message Format

{
    "type": "error",
    "error": {
        "code": "missing_field",
        "message": "audio field is required",
        "type": "client_error"
    }
}

Silent Drop

If the client sends chunks too quickly, the server may drop old chunks without returning an error. This ensures the newest chunk is always preferred.

Invalid JSON

If a WebSocket frame cannot be parsed as JSON, the connection is closed with close code 1003.

On this page