Extreme Infrastructure for GRPO & Large-Scale Reinforcement Learning.
RL-Kernel is a high-performance, memory-efficient infrastructure for Reinforcement Learning post-training. It eliminates the memory and latency bottlenecks in Large Language Model alignment, This project targets AI infrastructure engineers, algorithm researchers, and enterprise-level large model alignment scenarios, providing specialized kernels for algorithms like GRPO, PPO, and DPO.
RL-Kernel is designed to solve the
By implementing Pre-allocated Chunking, RL-Kernel maintains constant additional VRAM overhead regardless of the group size (
Testbed: NVIDIA A100 80GB | Model: Llama-3-8B | Vocab: 128,256 | SeqLen: 512
| Group Size ( |
TRL (Standard) | PyTorch Native | RL-Kernel (Ours) | Status |
|---|---|---|---|---|
| G = 64 | OOM | 15.66 GB | 16.15 GB | Success |
| G = 128 | OOM | 31.31 GB | 31.80 GB | Success |
| G = 256 | FAILED (OOM) | 62.63 GB | 63.12 GB | Optimized |
Note: RL-Kernel is the only solution that successfully scales G=256 on a single A100 by keeping extra VRAM usage to a constant ~0.5GB.
Integrating FlashInfer fused kernels to accelerate the bottleneck of RL training: the sampling phase.
| Batch Size ( |
Native PyTorch | RL-Kernel (Fused) | Speedup |
|---|---|---|---|
| 32 | 176.79 ms | 1.08 ms | 163x |
| 64 | 10.54 ms | 1.31 ms | 8x |
| 128 | 18.89 ms | 1.86 ms | 10x |
| 256 | 36.23 ms | 2.94 ms | 12x |
Testbed: NVIDIA A100 80GB | Model: Qwen3-30B-A3B | Vocab: 151,936 | dtype: fp16
Model weights consume 56.9 GB — only 23 GB headroom remaining for training computation.
- Zero-Growth Memory Pool: Uses pre-allocated buffers and micro-chunking to prevent VRAM spikes during advantage calculation.
- Fused Sampling Pipeline: Direct integration with FlashInfer and vLLM backends for sub-2ms sampling latency.
- Universal Backend Abstraction: Unified API supporting both NVIDIA (CUDA/FlashInfer) and AMD (ROCm/AITER).
- Post-Training Ready: Drop-in replacement for standard sampling and logprob operators in TRL or DeepSpeed-Chat.
RL-Kernel sits between high-level alignment libraries and low-level GPU kernels, ensuring maximum throughput without sacrificing flexibility.
# Clone the repository
git clone https://github.com/Flink-ddd/RL-Kernel.git
cd RL-Kernel
# Install core dependencies (CUDA 12.4+ recommended)
pip install -e .Inspired by the kernel designs of vLLM and DeepSpeed. As an active contributor to the AI Infrastructure ecosystem, RL-Kernel aims to push the boundaries of RL efficiency.
Target: Building the most efficient RLHF toolchain for the open-source community.
RL-Kernel builds on the shoulders of excellent open-source projects:
- FlashInfer — We integrate FlashInfer's fused sampling kernels as the NVIDIA backend for our sampling pipeline. The sub-2ms sampling latency results are enabled by FlashInfer's highly optimized CUDA operators.
- vLLM — Inspired by vLLM's kernel design philosophy and hardware-aware scheduling approach.
- DeepSpeed — Inspired by DeepSpeed's approach to memory-efficient training infrastructure.
We are grateful to these teams for their contributions to the open-source AI infrastructure ecosystem.


