Skip to content

Conversation

@chaplain-grimaldus
Copy link
Contributor

@chaplain-grimaldus chaplain-grimaldus bot commented Jan 29, 2026

This PR contains the following updates:

Package Update Change
vllm-charts minor v0.14.1v0.15.0

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


Release Notes

vllm-project/vllm (vllm-charts)

v0.15.0

Compare Source

Highlights

This release features 335 commits from 158 contributors (39 new)!

Model Support
Engine Core
  • Async scheduling + Pipeline Parallelism: --async-scheduling now works with pipeline parallelism (#​32359).
  • Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with --enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#​30877).
  • Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing StreamingInput objects while maintaining KV cache alignment (#​28973).
  • Model Runner V2: VLM support (#​32546), architecture improvements.
  • LoRA: Inplace loading for memory efficiency (#​31326).
  • AOT compilation: torch.compile inductor artifacts support (#​25205).
  • Performance: KV cache offloading redundant load prevention (#​29087), FlashAttn attention/cache update separation (#​25954).
Hardware & Performance
NVIDIA
  • Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#​32615).
  • MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#​32058), NVFP4 small-batch decoding improvement (#​30885), faster cold start for MoEs with torch.compile (#​32805).
  • FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#​32520).
  • Kernel improvements: topk_sigmoid kernel for MoE routing (#​31246), atomics reduce counting for SplitK skinny GEMMs (#​29843), fused cat+quant for FP8 KV cache in MLA (#​32950).
  • torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#​32806), Triton prefill attention performance (#​32403).
AMD ROCm
  • MoRI EP: High-performance all2all backend for Expert Parallel (#​28664).
  • Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#​29887).
  • FP4 support: MLA projection GEMMs with dynamic quantization (#​32238).
  • Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#​32944).
Other Platforms
  • TPU: Pipeline parallelism support (#​28506), backend option (#​32438).
  • Intel XPU: AgRsAll2AllManager for distributed communication (#​32654).
  • CPU: NUMA-aware acceleration for TP/DP inference on ARM (#​32792), PyTorch 2.10 (#​32869).
  • Whisper: torch.compile support (#​30385).
  • WSL: Platform compatibility fix for Windows Subsystem for Linux (#​32749).
Quantization
  • MXFP4: W4A16 support for compressed-tensors MoE models (#​32285).
  • Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#​32257).
  • Intel: Quantization Toolkit integration (#​31716).
  • FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#​30141).
API & Frontend
  • Responses API: Partial message generation (#​32100), include_stop_str_in_output tuning (#​32383), prompt_cache_key support (#​32824).
  • OpenAI API: skip_special_tokens configuration (#​32345).
  • Score endpoint: Flexible input formats with data_1/data_2 and queries/documents (#​32577).
  • Render endpoints: New endpoints for prompt preprocessing (#​32473).
  • Whisper API: avg_logprob and compression_ratio in verbose_json segments (#​31059).
  • Security: FIPS 140-3 compliant hash option for enterprise/government users (#​32386), --ssl-ciphers CLI argument (#​30937).
  • UX improvements: Auto api_server_count based on dp_size (#​32525), wheel variant auto-detection during install (#​32948), custom profiler URI schemes (#​32393).
Dependencies
Breaking Changes & Deprecations
  • Metrics: Removed deprecated vllm:time_per_output_token_seconds metric - use vllm:inter_token_latency_seconds instead (#​32661).
  • Environment variables: Removed deprecated environment variables (#​32812).
  • Quantization: DeepSpeedFp8 removed (#​32679), RTN removed (#​32697), HQQ deprecated (#​32681).
Bug Fixes
  • Speculative decoding: Eagle draft_model_config fix (#​31753).
  • DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#​32361).
  • Distributed: DP+MoE inference fix via CpuCommunicator (#​31867), P/D with non-MoE DP fix (#​33037).
  • EPLB: Possible deadlock fix (#​32418).
  • NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#​32181).
  • Structured output: Outlines byte fallback handling fix (#​31391).

New Contributors 🎉

Full Changelog: vllm-project/vllm@v0.14.1...v0.15.0


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants