Name		Name	Last commit message	Last commit date
parent directory ..
llama.cpp-omni		llama.cpp-omni
llama.cpp		llama.cpp
ollama		ollama
sglang		sglang
vllm		vllm
README.md		README.md
README_zh.md		README_zh.md

README.md

Model Deployment Guide

Multiple deployment solutions for efficient MiniCPM-o model deployment across different environments.

📖 中文版本 | Back to Main

Deployment Framework Comparison

Framework	Performance	Ease of Use	Scalability	Hardware	Best For
vLLM	High	Medium	High	GPU	Large-scale production services
SGLang	High	Medium	High	GPU	Structured generation tasks
Ollama	Medium	Excellent	Medium	CPU/GPU	Personal use, rapid prototyping
Llama.cpp	Medium	High	Medium	CPU	Edge devices, lightweight deployment

Framework Details

vLLM (Very Large Language Model)

High-throughput inference engine with PagedAttention memory management
Dynamic batching support, OpenAI-compatible API
Ideal for production API services and large-scale batch inference
Recommended hardware: GPU with more than 18GB of VRAM

SGLang (Structured Generation Language)

Structured generation optimization with efficient KV cache management
Complex control flow and function calling optimization support
Suitable for complex reasoning chains and structured text generation
Recommended hardware: GPU with more than 18GB of VRAM

Ollama

One-click model management with simple command-line interface
Automatic quantization support, REST API interface
Perfect for personal development environments and research prototyping
Hardware requirements: 8GB+ RAM, supports CPU/GPU

Llama.cpp

Pure C++ implementation with CPU-optimized inference
Multiple quantization support, lightweight deployment
Ideal for mobile devices and edge computing
Hardware requirements: 4GB+ RAM, various CPU architectures

Selection Guide

Production Environment (High Concurrency): vLLM - Best performance, optimal scalability
Complex Reasoning Tasks: SGLang - Structured generation, function calling optimization
Personal Development: Ollama - Simple to use, quick setup
Edge Deployment: Llama.cpp - Lightweight, low power consumption

MiniCPM-V 4.5 Framework Support Matrix

Category	Framework	Cookbook Link	Upstream PR	Supported since(branch)	Supported since(release)
Edge(On-device)	Llama.cpp	Llama.cpp Doc	#15575(2025-08-26)	master(2025-08-26)	b6282
Edge(On-device)	Ollama	Ollama Doc	#12078(2025-08-26)	Merging	Waiting for official release
Serving(Cloud)	vLLM	vLLM Doc	#23586(2025-08-26)	main(2025-08-27)	v0.10.2
Serving(Cloud)	SGLang	SGLang Doc	#9610(2025-08-26)	Merging	Waiting for official release
Finetuning	LLaMA-Factory	LLaMA-Factory Doc	#9022(2025-08-26)	main(2025-08-26)	Waiting for official release
Quantization	GGUF	GGUF Doc	—	—	—
	BNB	BNB Doc	—	—	—
	AWQ	AWQ Doc	—	—	—
Demos	Gradio Demo	Gradio Demo Doc	—	—	—