Skip to content

RL-Align/RL-Kernel

Repository files navigation

RL-Kernel Logo

RL-Kernel

Extreme Infrastructure for GRPO & Large-Scale Reinforcement Learning.

License Hardware Discord Documentation Ask DeepWiki

RL-Kernel is a high-performance, memory-efficient infrastructure for Reinforcement Learning post-training. It eliminates the memory and latency bottlenecks in Large Language Model alignment, This project targets AI infrastructure engineers, algorithm researchers, and enterprise-level large model alignment scenarios, providing specialized kernels for algorithms like GRPO, PPO, and DPO.


Performance Benchmarks: Breaking the Memory Wall

RL-Kernel is designed to solve the $O(G \cdot L \cdot V)$ memory explosion in DeepSeek-style GRPO training. A typical scenario is as follows:

1. Logprob Computation (Training Stability)

By implementing Pre-allocated Chunking, RL-Kernel maintains constant additional VRAM overhead regardless of the group size ($G$).

Testbed: NVIDIA A100 80GB | Model: Llama-3-8B | Vocab: 128,256 | SeqLen: 512

Group Size ($G$) TRL (Standard) PyTorch Native RL-Kernel (Ours) Status
G = 64 OOM 15.66 GB 16.15 GB Success
G = 128 OOM 31.31 GB 31.80 GB Success
G = 256 FAILED (OOM) 62.63 GB 63.12 GB Optimized

Note: RL-Kernel is the only solution that successfully scales G=256 on a single A100 by keeping extra VRAM usage to a constant ~0.5GB.

Logprob VRAM Benchmark

2. Sampling Latency (Rollout Speed)

Integrating FlashInfer fused kernels to accelerate the bottleneck of RL training: the sampling phase.

Batch Size ($G$) Native PyTorch RL-Kernel (Fused) Speedup
32 176.79 ms 1.08 ms 163x
64 10.54 ms 1.31 ms 8x
128 18.89 ms 1.86 ms 10x
256 36.23 ms 2.94 ms 12x

Sampling Latency Benchmark

3. Real Model Validation (Qwen3-30B-A3B MoE)

Testbed: NVIDIA A100 80GB | Model: Qwen3-30B-A3B | Vocab: 151,936 | dtype: fp16

Model weights consume 56.9 GB — only 23 GB headroom remaining for training computation.

Real Model MoE Benchmark

Key Features

  • Zero-Growth Memory Pool: Uses pre-allocated buffers and micro-chunking to prevent VRAM spikes during advantage calculation.
  • Fused Sampling Pipeline: Direct integration with FlashInfer and vLLM backends for sub-2ms sampling latency.
  • Universal Backend Abstraction: Unified API supporting both NVIDIA (CUDA/FlashInfer) and AMD (ROCm/AITER).
  • Post-Training Ready: Drop-in replacement for standard sampling and logprob operators in TRL or DeepSpeed-Chat.

Architecture

RL-Kernel sits between high-level alignment libraries and low-level GPU kernels, ensuring maximum throughput without sacrificing flexibility.


Quick Start

Installation

# Clone the repository
git clone https://github.com/Flink-ddd/RL-Kernel.git
cd RL-Kernel

# Install core dependencies (CUDA 12.4+ recommended)
pip install -e .

Contributions

Inspired by the kernel designs of vLLM and DeepSpeed. As an active contributor to the AI Infrastructure ecosystem, RL-Kernel aims to push the boundaries of RL efficiency.

Target: Building the most efficient RLHF toolchain for the open-source community.

Acknowledgments

RL-Kernel builds on the shoulders of excellent open-source projects:

  • FlashInfer — We integrate FlashInfer's fused sampling kernels as the NVIDIA backend for our sampling pipeline. The sub-2ms sampling latency results are enabled by FlashInfer's highly optimized CUDA operators.
  • vLLM — Inspired by vLLM's kernel design philosophy and hardware-aware scheduling approach.
  • DeepSpeed — Inspired by DeepSpeed's approach to memory-efficient training infrastructure.

We are grateful to these teams for their contributions to the open-source AI infrastructure ecosystem.

About

Modern RL Post-training Infrastructure: Optimized for NVIDIA/AMD GPUs with a focus on vLLM integration, Triton kernels, and transparent hardware-aware scaling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors