MiniCPM-o Docs

Configuration & Deployment

System Requirements

RequirementMinimum
GPUNVIDIA GPU with VRAM > 28GB
OSLinux
Python3.10
CUDACompatible with PyTorch 2.8.0
FFmpegFor video frame extraction and inference result visualization

Resource Consumption Reference

ResourceToken2Wav (Default)
VRAM (per Worker, after initialization)~21.5 GB
Model loading time~16s
Mode switching latency< 0.1ms

torch.compile mode incurs an additional ~60s compilation time on first inference.


Dependency Installation

# 1. Install Python 3.10 (miniconda recommended)
mkdir -p ./miniconda3_install_tmp
wget https://repo.anaconda.com/miniconda/Miniconda3-py310_25.11.1-1-Linux-x86_64.sh \
    -O ./miniconda3_install_tmp/miniconda.sh
bash ./miniconda3_install_tmp/miniconda.sh -b -u -p ./miniconda3
source ./miniconda3/bin/activate

# 2. One-command installation
bash ./install.sh

install.sh automatically performs the following steps:

  1. Creates a Python venv virtual environment at .venv/base
  2. Installs PyTorch 2.8.0 + torchaudio
  3. Installs all dependencies from requirements.txt
  4. Verifies the installation

Manual Installation

source ./miniconda3/bin/activate
python -m venv .venv/base
source .venv/base/bin/activate

pip install "torch==2.8.0" "torchaudio==2.8.0"
pip install -r requirements.txt

Python Dependency List

CategoryPackageVersion
Core MLtransformers4.51.0
accelerate1.12.0
safetensors>= 0.7.0
MiniCPM-ominicpmo-utils[all]>= 1.0.5
Web Servicefastapi>= 0.128.0
uvicorn>= 0.40.0
httpx>= 0.28.0
websockets>= 16.0
python-multipart
Datapydantic>= 2.11.0
numpy>= 2.2.0
Utilitiestqdm>= 4.67.0
Testingpytest>= 9.0.0
pytest-asyncio>= 1.3.0

Configuration

config.json

All configuration is centralized in the config.json file at the project root. Copy from config.example.json for first-time setup:

cp config.example.json config.json

config.json is listed in .gitignore and will not be committed.

Configuration Priority

CLI arguments > config.json > Pydantic defaults

Complete Field Reference

model — Model Configuration

FieldTypeDefaultDescription
model_pathstr(required)HuggingFace format model directory or Hub ID
pt_pathstrnullOptional .pt weight override path
attn_implementationstr"auto"Attention implementation method

audio — Audio Configuration

FieldTypeDefaultDescription
ref_audio_pathstrassets/ref_audio/ref_minicpm_signature.wavDefault TTS reference audio path
playback_delay_msint200Frontend audio playback delay (ms); higher values are smoother but add latency

service — Service Configuration

FieldTypeDefaultDescription
gateway_portint8006Gateway listening port
worker_base_portint22400Worker base port (Worker N = base + N)
max_queue_sizeint1000Maximum queued requests
request_timeoutfloat300.0Request timeout (seconds)
compileboolfalseEnable torch.compile acceleration
data_dirstr"data"Data storage directory
eta_chat_sfloat15.0Chat task baseline ETA (seconds)
eta_streaming_sfloat20.0Streaming task baseline ETA (seconds)
eta_audio_duplex_sfloat120.0Audio Duplex task baseline ETA (seconds)
eta_omni_duplex_sfloat90.0Omni Duplex task baseline ETA (seconds)
eta_ema_alphafloat0.3ETA EMA smoothing coefficient
eta_ema_min_samplesint3ETA EMA minimum sample count

duplex — Duplex Configuration

FieldTypeDefaultDescription
pause_timeoutfloat60.0Duplex pause timeout (seconds); the Worker is automatically released after timeout

Minimal Configuration

{
  "model": {
    "model_path": "openbmb/MiniCPM-o-4_5"
  }
}

Full Configuration Example

{
  "model": {
    "model_path": "openbmb/MiniCPM-o-4_5",
    "pt_path": null,
    "attn_implementation": "auto"
  },
  "audio": {
    "ref_audio_path": "assets/ref_audio/ref_minicpm_signature.wav",
    "playback_delay_ms": 200,
    "chat_vocoder": "token2wav"
  },
  "service": {
    "gateway_port": 8006,
    "worker_base_port": 22400,
    "max_queue_size": 1000,
    "request_timeout": 300.0,
    "compile": false,
    "data_dir": "data",
    "eta_chat_s": 15.0,
    "eta_streaming_s": 20.0,
    "eta_audio_duplex_s": 120.0,
    "eta_omni_duplex_s": 90.0,
    "eta_ema_alpha": 0.3,
    "eta_ema_min_samples": 3
  },
  "duplex": {
    "pause_timeout": 60.0
  }
}

Attention Backend

Controls the Attention implementation used for model inference, configured via the attn_implementation field.

ValueBehaviorUse Case
"auto" (default)Detects flash-attn → flash_attention_2; otherwise → sdpaRecommended
"flash_attention_2"Forces Flash Attention 2When flash-attn is confirmed installed
"sdpa"PyTorch built-in SDPAWhen flash-attn cannot be compiled
"eager"Naive AttentionDebug only

Performance Comparison (A100): flash_attention_2 is ~5-15% faster than sdpa; sdpa is several times faster than eager.

Note: The Audio (Whisper) submodule always uses SDPA (incompatible with flash_attention_2). Vision / LLM / TTS follow the configuration.


Starting & Stopping

One-Command Start (start_all.sh)

# Use all available GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 bash start_all.sh

# Specify GPUs
CUDA_VISIBLE_DEVICES=0,1 bash start_all.sh

# Enable torch.compile (experimental)
bash start_all.sh --compile

# Downgrade to HTTP (not recommended; microphone/camera APIs require HTTPS)
bash start_all.sh --http

start_all.sh execution flow:

  1. Parses command-line arguments (--http, --compile)
  2. Reads port configuration from config.py
  3. Detects the number of available GPUs
  4. Launches one Worker process per GPU (nohup)
  5. Waits for all Workers to pass health checks
  6. Starts the Gateway process
  7. Outputs the access URL and log paths

Manual Start

# Worker (one per GPU)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python worker.py \
    --worker-index 0 --gpu-id 0

# Gateway
PYTHONPATH=. .venv/base/bin/python gateway.py \
    --port 8006 --workers localhost:22400

CLI Arguments

Worker Arguments:

python worker.py \
    --model-path /path/to/model \
    --pt-path /path/to/weights.pt \
    --ref-audio-path /path/to/ref.wav \
    --worker-index 0 \
    --gpu-id 0 \
    --compile

Gateway Arguments:

python gateway.py \
    --port 8006 \
    --workers localhost:22400,localhost:22401 \
    --http

Stopping the Service

pkill -f "gateway.py|worker.py"

Model Download

Automatic Download (Default)

When model_path is set to openbmb/MiniCPM-o-4_5, the model is automatically downloaded from HuggingFace on first startup.

Manual Download

HuggingFace CLI:

pip install -U huggingface_hub
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/MiniCPM-o-4_5

hf-mirror (China):

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/MiniCPM-o-4_5

ModelScope (China):

pip install modelscope
modelscope download --model OpenBMB/MiniCPM-o-4_5 --local_dir /path/to/MiniCPM-o-4_5

Testing

Schema Unit Tests (No GPU Required)

PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_schemas.py -v

Processor Tests (GPU Required)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python -m pytest \
    tests/test_chat.py tests/test_streaming.py tests/test_duplex.py -v -s

API Integration Tests (Service Must Be Running)

PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_api.py -v -s

Test File Reference

FileDescription
test_schemas.pySchema unit tests
test_chat.pyChat inference tests
test_streaming.pyStreaming inference tests
test_duplex.pyDuplex inference tests
test_api.pyAPI integration tests
test_queue.pyQueue logic tests
test_queue_stress.pyQueue stress tests
test_integration.pyIntegration tests
test_e2e.pyEnd-to-end tests
bench_duplex_ws.pyDuplex WebSocket performance benchmark
mock_worker.pyMock Worker (for GPU-free testing)
js/queue-scenario.test.jsFrontend queue scenario tests (Vitest)
js/countdown-timer.test.jsCountdown component tests (Vitest)

Runtime Directory Structure

data/
├── sessions/                  # Session recording data
│   ├── omni_abc123/
│   │   ├── meta.json
│   │   ├── recording.json
│   │   ├── user_audio/
│   │   ├── ai_audio/
│   │   └── ...
│   └── ...
└── ref_audio/                 # Uploaded reference audios
    ├── registry.json
    └── *.wav

tmp/
├── gateway.pid                # Gateway process PID
├── gateway.log                # Gateway log
├── worker_0.pid               # Worker 0 PID
├── worker_0.log               # Worker 0 log
└── diag_omni_*.jsonl          # Diagnostic logs

On this page