MiniCPM-o Docs
Architecture

Half-Duplex Audio Mode Details

The core inference mode behind the Half-Duplex Audio page. Communicates via the /ws/half_duplex/{session_id} WebSocket endpoint, implementing VAD-based half-duplex voice conversation.

In One Sentence

VAD automatically detects when the user finishes speaking, feeds the speech segment to the model for reply generation, waits for playback to complete, then resumes listening — like a phone call with turn-taking.

Comparison with Other Modes

Chat ModeHalf-Duplex ModeDuplex Mode
InteractionTurn-based (manual trigger)Turn-based (VAD auto-trigger)Full-duplex (simultaneous)
Input processingOne-shot prefill of all messagesVAD detects speech segment -> prefillPer-second streaming prefill of audio/video
Worker occupationInference duration only, released afterExclusive for entire session (default 3 min)Exclusive for entire session
Voice detectionNone (frontend manual recording)Server-side SileroVADNone (model decides autonomously)
Use caseText/multimodal Q&AVoice conversation, hands-freeReal-time voice/video conversation

Overall Flow

Simplified View

sequenceDiagram
    participant U as User
    participant VAD as SileroVAD
    participant MD as Model

    loop Half-duplex loop
        U->>VAD: Continuous audio stream
        VAD->>VAD: Detect speech start/end
        VAD->>MD: Speech segment (user finished)
        MD-->>U: Streaming reply (text + audio)
        Note over U: Wait for playback to finish
    end

Detailed View

sequenceDiagram
    participant FE as Frontend
    participant GW as Gateway
    participant WK as Worker
    participant VAD as StreamingVAD
    participant MD as HalfDuplexView

    FE->>GW: WS /ws/half_duplex/{session_id}
    GW->>WK: Exclusive Worker
    FE->>WK: prepare (system_prompt + config)
    WK->>MD: reset + prefill system prompt
    WK->>VAD: Initialize VAD

    loop Half-duplex loop
        FE->>WK: audio_chunk (continuous)
        WK->>VAD: feed(audio_chunk)
        alt User speaking
            WK-->>FE: vad_state: speaking
        else User stopped speaking
            WK-->>FE: vad_state: false
            WK-->>FE: generating
            WK->>MD: prefill(speech_segment)
            WK->>MD: generate(streaming)
            loop Streaming response
                MD-->>WK: text_delta + audio
                WK-->>FE: chunk
            end
            WK-->>FE: turn_done
            WK->>VAD: reset()
        end
    end

    FE->>WK: stop / timeout
    WK-->>FE: timeout

VAD Stage

The core of Half-Duplex is server-side VAD (Voice Activity Detection) using the SileroVAD ONNX model for real-time speech detection.

StreamingVAD

The StreamingVAD class in vad/vad.py encapsulates streaming VAD logic:

vad = StreamingVAD(options=StreamingVadOptions(
    threshold=0.8,              # Speech probability threshold
    min_speech_duration_ms=128, # Minimum speech segment length
    min_silence_duration_ms=800,# Minimum silence to confirm end of speech
    speech_pad_ms=30,           # Padding on each side of speech segment
))

# Feed chunks incrementally
for audio_chunk in audio_stream:
    speech_segment = vad.feed(audio_chunk)  # float32, 16kHz
    if speech_segment is not None:
        # User finished speaking
        model.prefill(speech_segment)
        model.generate()

How It Works

  1. Frontend sends an audio_chunk every 0.5 seconds (16kHz float32 PCM)
  2. StreamingVAD.feed() slides a 1024-sample window, calling SileroVAD for speech probability on each window
  3. Probability >= threshold: mark "speech started", accumulate audio to buffer
  4. Probability < (threshold - 0.15) sustained for >= min_silence_duration_ms: confirm "speech ended"
  5. Return accumulated speech segment, reset VAD state

False Trigger Prevention

  • Cold start guard: Ignore all VAD for 0.5s after prepare to avoid mic initialization noise
  • AI playback suppression: Frontend stops sending audio_chunk while AI audio is playing to prevent echo feedback
  • Post-playback delay: After turn_done, wait for AI audio to finish + 800ms buffer before resuming

Prefill + Generate Stage

After VAD detects a speech segment, HalfDuplexView handles inference. It reuses the model's streaming_prefill + streaming_generate capabilities:

  1. Speech segment -> Base64 encode -> AudioContent -> Message(role=USER)
  2. HalfDuplexView.prefill(request) — prefill user speech into KV Cache
  3. HalfDuplexView.generate() — streaming generation of text + audio chunks
  4. Each chunk sent to frontend via WebSocket

KV Cache persists throughout the session, supporting multi-turn context accumulation.

WebSocket Protocol

Endpoint

wss://host/ws/half_duplex/{session_id}

Gateway proxies this connection to a Worker, which is exclusively occupied for the entire session.

Client -> Server

TypeFieldsDescription
preparesystem_prompt, config, ref_audio_base64, system_contentInitialize session
audio_chunkaudio_base64Send mic audio (float32 PCM 16kHz)
stopStop session

config structure:

{
  "vad": {"threshold": 0.8, "min_speech_duration_ms": 128, "min_silence_duration_ms": 800, "speech_pad_ms": 30},
  "generation": {"max_new_tokens": 256, "length_penalty": 1.1, "temperature": 0.7},
  "tts": {"enabled": true},
  "session": {"timeout_s": 180}
}

Server -> Client

TypeFieldsDescription
queuedposition, estimated_wait_sQueued
queue_doneLeft queue
preparedsession_id, timeout_s, recording_session_idReady
vad_statespeakingVAD state change (user started/stopped speaking)
generatingspeech_duration_msStarting reply generation
chunktext_delta, audio_dataOne streaming chunk
turn_doneturn_index, textTurn generation complete
timeoutelapsed_sSession timeout
errorerrorError message

Call Chain

Frontend half_duplex.html
  └─ WebSocket /ws/half_duplex/{session_id}
      └─ Gateway (exclusive WS proxy)
          └─ Worker /ws/half_duplex
              ├─ prepare
              │   ├─ StreamingVAD init
              │   ├─ HalfDuplexView.prefill(system_prompt)
              │   ├─ TTS init (ref_audio)
              │   └─ TurnBasedSessionRecorder init
              └─ audio_chunk loop
                  ├─ StreamingVAD.feed(chunk)
                  ├─ Speech detected → HalfDuplexView.prefill(user_audio)
                  ├─ HalfDuplexView.generate() → streaming chunks
                  └─ VAD reset + wait for next turn

Frontend Parameter Pass-through

Settings panel parameters are sent to the backend via the config field in the prepare message. Parameters are saved to localStorage and only sent at session start — mid-session changes do not take effect.

CategoryParameterDefaultDescription
VADthreshold0.8Speech detection threshold (higher = stricter)
VADmin_speech_duration_ms128Minimum speech segment length
VADmin_silence_duration_ms800Silence duration to confirm end of speech
VADspeech_pad_ms30Padding on each side of speech segment
Generationmax_new_tokens256Maximum generated tokens
Generationlength_penalty1.1Length penalty coefficient
Generationtemperature0.7Sampling temperature
TTSenabledtrueEnable voice response
Sessiontimeout_s180Session timeout (seconds)

On this page