Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Model Deployment Guide

Multiple deployment solutions for efficient MiniCPM-o model deployment across different environments.

📖 中文版本 | Back to Main

Deployment Framework Comparison

Framework Performance Ease of Use Scalability Hardware Best For
vLLM High Medium High GPU Large-scale production services
SGLang High Medium High GPU Structured generation tasks
Ollama Medium Excellent Medium CPU/GPU Personal use, rapid prototyping
Llama.cpp Medium High Medium CPU Edge devices, lightweight deployment

Framework Details

vLLM (Very Large Language Model)

  • High-throughput inference engine with PagedAttention memory management
  • Dynamic batching support, OpenAI-compatible API
  • Ideal for production API services and large-scale batch inference
  • Recommended hardware: GPU with more than 18GB of VRAM

SGLang (Structured Generation Language)

  • Structured generation optimization with efficient KV cache management
  • Complex control flow and function calling optimization support
  • Suitable for complex reasoning chains and structured text generation
  • Recommended hardware: GPU with more than 18GB of VRAM
  • One-click model management with simple command-line interface
  • Automatic quantization support, REST API interface
  • Perfect for personal development environments and research prototyping
  • Hardware requirements: 8GB+ RAM, supports CPU/GPU
  • Pure C++ implementation with CPU-optimized inference
  • Multiple quantization support, lightweight deployment
  • Ideal for mobile devices and edge computing
  • Hardware requirements: 4GB+ RAM, various CPU architectures

Selection Guide

  • Production Environment (High Concurrency): vLLM - Best performance, optimal scalability
  • Complex Reasoning Tasks: SGLang - Structured generation, function calling optimization
  • Personal Development: Ollama - Simple to use, quick setup
  • Edge Deployment: Llama.cpp - Lightweight, low power consumption

MiniCPM-V 4.5 Framework Support Matrix

Category Framework Cookbook Link Upstream PR Supported since(branch) Supported since(release)
Edge(On-device) Llama.cpp Llama.cpp Doc #15575(2025-08-26) master(2025-08-26) b6282
Ollama Ollama Doc #12078(2025-08-26) Merging Waiting for official release
Serving(Cloud) vLLM vLLM Doc #23586(2025-08-26) main(2025-08-27) v0.10.2
SGLang SGLang Doc #9610(2025-08-26) Merging Waiting for official release
Finetuning LLaMA-Factory LLaMA-Factory Doc #9022(2025-08-26) main(2025-08-26) Waiting for official release
Quantization GGUF GGUF Doc
BNB BNB Doc
AWQ AWQ Doc
Demos Gradio Demo Gradio Demo Doc