MiniCPM-o Docs
Realtime API

Audio Full-Duplex

Audio full-duplex protocol for the Realtime API

Audio full-duplex mode is for realtime voice-only conversation. The client continuously sends 16 kHz audio; the server returns listen, text, and audio output as response.output.delta.

Connection

wss://host/v1/realtime?mode=audio
ItemValue
Frame formatJSON text frames
Input audio16 kHz mono float32 PCM, base64 encoded
Input videoNot needed
Output audio24 kHz mono float32 PCM, base64 encoded
Total session duration limit600 seconds

Initialize

After connecting, wait for session.queue_done, then send session.init:

{
  "type": "session.init",
  "payload": {
    "system_prompt": "You are a helpful voice assistant.",
    "config": {
      "length_penalty": 1.1
    },
    "voice": {
      "ref_audio_base64": "<base64 float32 PCM>",
      "tts_ref_audio_base64": "<base64 float32 PCM>"
    }
  }
}
FieldTypeRequiredDescription
system_promptstringNoSystem prompt. instructions is also accepted
configobjectNoDuplex inference settings; length_penalty is commonly used
voice.ref_audio_base64stringNoLLM reference audio
voice.tts_ref_audio_base64stringNoTTS reference audio; can fall back to the LLM reference audio

The server sends session.created after initialization completes.

Send Input

{
  "type": "input.append",
  "input": {
    "audio": "<base64 float32 PCM, 16 kHz mono>",
    "force_listen": false
  }
}
FieldTypeRequiredDescription
audiostringYes16 kHz mono float32 PCM. The browser demo usually sends one chunk per second
force_listenboolNoForce the model back to listen state
hints.force_listenboolNoEquivalent position for force_listen

Audio mode does not need video_frames.

Receive Output

Model keeps listening:

{
  "type": "response.output.delta",
  "kind": "listen",
  "session_id": "sess_xxx",
  "metrics": {
    "kv_cache_length": 1024
  }
}

Text delta:

{
  "type": "response.output.delta",
  "kind": "text",
  "text": "Hello",
  "response_id": "resp_xxx",
  "metrics": {}
}

Audio delta:

{
  "type": "response.output.delta",
  "kind": "audio",
  "audio": "<base64 float32 PCM, 24 kHz mono>",
  "response_id": "resp_xxx",
  "metrics": {}
}

Text and audio are separate deltas and are not guaranteed to map one-to-one. Clients should update captions and the audio playback queue independently; kind=listen means the model has returned to listening for user input.

Timeline

Client connects
  <- session.queued / session.queue_update   optional
  <- session.queue_done
  -> session.init { payload: {...} }
  <- session.created
  -> input.append { input: { audio } }
  -> input.append { input: { audio } }
  <- response.output.delta { kind: "listen" }
  -> input.append { input: { audio } }
  <- response.output.delta { kind: "text", text: "..." }
  <- response.output.delta { kind: "audio", audio: "..." }
  <- response.output.delta { kind: "listen" }
  -> session.close
  <- session.closed

Audio full-duplex mode does not use response.done for every output turn. The boundary is represented by kind=listen, later input, and session.closed.

Close

{
  "type": "session.close",
  "reason": "user_stop"
}

The server makes a best effort to return session.closed. If the 600 second limit is reached, the gateway sends:

{
  "type": "session.closed",
  "reason": "timeout"
}

On this page