MiniCPM-o Docs
Realtime API

Audio Full-Duplex Protocol

Realtime audio-only full-duplex API protocol

This document defines the WebSocket protocol for Audio Full-Duplex mode. For video mode, see video-duplex-protocol.md.


One-Sentence Summary

One WebSocket connection: the client sends one second of audio every second, and the server streams audio and text back in real time.


Mode Constraints

ItemValue
Endpointwss://host/v1/realtime?mode=audio
Frame formatJSON text frames
Input audio16 kHz, mono, float32 PCM, base64 encoded
Output audio24 kHz, mono, float32 PCM, base64 encoded
Input videoNot recommended. Behavior is undefined if video_frames is sent
Total session duration limit600 seconds (10 minutes), including waiting and idle time
Effective conversation time~8 minutes
Context window8192 tokens, fixed. The server closes the session when full

Lifecycle

┌─────────┐      ┌─────────┐      ┌─────────┐
│ Setup   │ ───→ │ Stream  │ ───→ │ Close   │
└─────────┘      └─────────┘      └─────────┘

Setup

Connection URL: wss://host/v1/realtime?mode=audio

session_id is generated by the server in the format rt_{timestamp_ms} and returned by session.created.

Client  ──WSS──→  Server
                  ← session.queued          (optional: only when queued)
                  ← session.queue_update    (optional: 0..N updates)
                  ← session.queue_done      (required: worker assigned)
Client  → session.update      "Use this prompt and voice"
Server  → session.created     "Ready, session_id=xxx"

Important: the client must wait for session.queue_done before sending session.update. session.queue_done is a required event. If a worker is already idle, the server sends it immediately.

Stream

Two independent streams run at the same time without blocking each other:

Upstream (Client → Server):          Downstream (Server → Client):
Every second, send:                  The model may push at any time:
  - 1 second of 16kHz audio            - 24kHz response audio chunks
  - no video                           - response text
                                       - listen-state signals

Close

There are three close reasons:

reasonTriggered ByMeaning
user_stopClientUser ends the session
timeoutServerTotal session duration reaches 600 seconds
context_fullServerThe 8192-token context window is full

Event Overview

Client → Server

EventWhen to SendPayload
session.updateOnce at startupSystem prompt and reference voice
input_audio_buffer.appendOnce per second1 second of audio, no video
session.closeWhen the client wants to endClose reason

Server → Client

EventMeaningPayload
session.createdSession is configuredsession_id
response.output_audio.deltaThe model is speakingAudio + text + end_of_turn
response.listenThe model is listeningMonitoring data such as KV cache length

Auxiliary events: session.queued, session.queue_update, session.queue_done, session.closed, error.


Message Format

session.update (Client → Server)

{
    "type": "session.update",
    "session": {
        "instructions": "You are a helpful English assistant.",
        "ref_audio": "<base64 WAV, 16kHz>",
        "tts_ref_audio": "<base64 WAV, 16kHz>"
    }
}
FieldTypeRequiredDescription
instructionsstringYesSystem prompt
ref_audiostringNoLLM reference audio (base64 WAV, 16kHz), used for semantic/style cloning
tts_ref_audiostringNoTTS reference audio (base64 WAV, 16kHz), used for acoustic voice cloning. Falls back to ref_audio when omitted

Audio Full-Duplex mode has no max_slice_nums field because there is no video input.

input_audio_buffer.append (Client → Server)

{
    "type": "input_audio_buffer.append",
    "audio": "<base64, 16000 samples = 1s, float32 PCM>",
    "force_listen": false
}
FieldTypeRequiredDescription
audiostringYes16 kHz mono float32 PCM. One second = 16000 samples = 64000 bytes, base64 encoded. Minimum: 4000 samples (250ms)
force_listenboolNoForce the model into listen state, interrupting speech. Default: false

video_frames is not recommended in Audio Full-Duplex mode. Behavior is undefined if it is sent.

session.close (Client → Server)

{
    "type": "session.close",
    "reason": "user_stop"
}
FieldTypeRequiredDescription
reasonstringNoClose reason. Use "user_stop" for user-initiated close

session.created (Server → Client)

{
    "type": "session.created",
    "session_id": "rt_1714200000000",
    "prompt_length": 256
}

response.output_audio.delta (Server → Client)

{
    "type": "response.output_audio.delta",
    "text": "The weather is nice today",
    "audio": "<base64, 24000 samples = 1s, float32 PCM>",
    "end_of_turn": false,
    "kv_cache_length": 1024
}
FieldTypeRequiredDescription
textstringYesText fragment generated in this delta
audiostringYes24 kHz mono float32 PCM, base64 encoded
end_of_turnboolYesWhether this turn has ended. true means the model has finished the utterance and will switch back to listening
kv_cache_lengthintYesCurrent KV cache token count. Limit: 8192

Text/Audio Alignment

Because of the model architecture, text generation leads audio synthesis. The text and audio fields inside the same output_audio.delta are not strictly synchronized; text usually leads audio by several hundred milliseconds.

Example:

// delta 1
{ "text": "The weather is nice", "audio": "<audio: 'The weather'>" }

// delta 2
{ "text": " today", "audio": "<audio: 'is nice today'>" }

Clients should use audio playback progress as the primary user experience timeline. Text can be shown as preview text or subtitles.

Output Audio Length

  • Middle deltas: audio is exactly 1 second (24000 samples).
  • First delta: audio may be shorter than 1 second.
  • Last delta (end_of_turn=true): audio may be shorter than 1 second.

response.listen (Server → Client)

{
    "type": "response.listen",
    "kv_cache_length": 1024
}

The model is currently listening. The client should stop any remaining queued playback audio when this event is received.

session.closed (Server → Client)

{
    "type": "session.closed",
    "reason": "timeout"
}
reasonMeaning
stoppedConfirmation after the client sends session.close
timeoutTotal session duration reached 600 seconds in audio mode
context_full8192-token context window is full
server_shutdownServer is shutting down
errorSession terminated by an unrecoverable error

Complete Timeline

Time ──────────────────────────────────────────────────────────→

                           ┌─────────────────────────────────┐
                           │  Phase 1: Connect & Queue       │
                           └─────────────────────────────────┘
Client:  WSS Connect ─────→
                           ← Server: session.queued
                           ← Server: session.queue_update
                           ← Server: session.queue_done

  The client should not send messages during queueing.
  If a worker is immediately available, queueing finishes immediately.

                           ┌─────────────────────────────────┐
                           │  Phase 2: Session Init          │
                           └─────────────────────────────────┘
Client:  session.update ─┐
Server:  session.created ←┘

                           ┌─────────────────────────────────┐
                           │  Phase 3: Full-Duplex Stream    │
                           └─────────────────────────────────┘
Client:  append(audio1) ──→
Client:  append(audio2) ──→
Client:  append(audio3) ──→     ← Server: listen
Client:  append(audio4) ──→
Client:  append(audio5) ──→     ← Server: listen
Client:  append(audio6) ──→     ← Server: output_audio.delta("Hello", audio, end_of_turn=false)
Client:  append(audio7) ──→     ← Server: output_audio.delta(",", audio, end_of_turn=false)
Client:  append(audio8) ──→     ← Server: output_audio.delta("how can I help?", audio, end_of_turn=true)
Client:  append(audio9) ──→     ← Server: listen
...

                           ┌─────────────────────────────────┐
                           │  Phase 4: Close                 │
                           └─────────────────────────────────┘
                           Any condition may trigger close:
                           - client sends session.close
                           - total duration >= 600s → timeout
                           - KV cache >= 8192 → context_full

Client:  session.close ──→
                           ← Server: session.closed {reason: "stopped"}

The client continues sending append events regardless of whether the server is listening or speaking. That is the full-duplex behavior.


Features Not Included as Protocol Events

FeatureWhy No Separate Event Is Needed
Pause/ResumeStopping append is equivalent to pausing; the model keeps listening
Cancel GenerationUse force_listen=true inside append to interrupt speech
Response Done Markerend_of_turn=true already marks the end; no response.done event is needed
Context Window ConfigurationFixed at 8192 tokens

State Machine

          connect


    ┌──── QUEUED ─────┐
    │                  │    waiting for worker assignment
    │                  │    client must not send messages
    │                  │    may receive session.queued / session.queue_update
    └────────┬─────────┘
             │ receives session.queue_done

    ┌─── CONNECTED ───┐
    │                  │    only session.update is allowed
    │                  │    otherwise → error (invalid_event)
    └────────┬─────────┘
             │ receives session.created

    ┌──── ACTIVE ─────┐
    │                  │    append / close allowed
    │                  │    append may include force_listen=true
    └────────┬─────────┘
             │ close / timeout / context_full / error

         CLOSED

Error Codes

Client Errors

codeMeaningWS Close
not_readyData sent before the session is readyNo
unknown_eventUnknown event typeNo
missing_fieldRequired field missingNo
invalid_payloadInvalid field value, such as base64 decode failureNo

Server Errors

codeMeaningWS Close
service_unavailableService is not readyYes (1013)
queue_fullQueue is fullYes (1013)
worker_busyNo idle worker availableYes (1013)
worker_connect_failedWorker connection failedYes (1013)
inference_errorInference failedNo, recoverable

Error Message Format

{
    "type": "error",
    "error": {
        "code": "invalid_payload",
        "message": "audio base64 decode failed",
        "type": "client_error"
    }
}

Silent Drop

If the client sends chunks too quickly, the server may drop old chunks without returning an error. This ensures the newest chunk is always preferred.

Invalid JSON

If a WebSocket frame cannot be parsed as JSON, the connection is closed with close code 1003.

On this page