You can use the scripts in this repository to reproduce the performance numbers in Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput | LMSYS Org.
Each prefill group uses 2 nodes; the scripts start 1–3 prefill groups in total. The current scripts start 1 prefill group.
Each decode group uses 12 nodes.
The router and client both run on the first node.
Here, “client” means the benchmark workload.
Run all commands after logging into the first node.
Sometimes you also need to enter Docker before running them.
bash 0.prepare_docker.sh
bash 0.download_model.sh
bash 1.salloc.sh
On the allocation shell, run:
bash 2.get_node_list_env.sh
This generates node_list_env.sh, which the later server launch scripts source.
3.launch_prefill_server.sh
4.launch_decode_server.sh
Start the router after the decode and prefill servers have finished starting.
A server (decode or prefill) is ready when its log shows: “The server is fired up and ready to roll!”.
# Enter Docker
bash enroot_exec_first_container.sh
# Start router
bash 5.launch_router.sh
# Enter Docker
bash enroot_exec_first_container.sh
# Start benchmark
bash 6.start_benchmark.sh
# Enter Docker
bash enroot_exec_first_container.sh
# After the decoder receives this command, each run_batch() sleeps 180s before model forward.
bash 7.start_slow_down_decode.sh
After enabling the 180s slowdown, every decode run_batch() sleeps 180s, so KV caches produced by prefill keep accumulating.
Each decode run_batch() can therefore schedule more running-req in parallel.
The number of running requests can grow up to SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK.
Watch the decode logs; once running-req reaches SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK, run the next step: send slow_down null so decode returns to normal (no sleep).
# Enter Docker
bash enroot_exec_first_container.sh
# After the decoder receives this command, it stops sleeping before each forward.
bash 8.stop_slow_down_decode.sh
After sending this command, wait on the order of ~180s before decode visibly reacts.
[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph begin. This can take up to several minutes. avail mem=40.56 GB
[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph bs [1024]
[2025-11-22 23:05:27 DP0 TP0 EP0] Capture cuda graph end. Time elapsed: 19.62 s. mem usage=31.07 GB. avail mem=9.49 GB.
[2025-11-22 23:05:30 DP0 TP0 EP0] max_total_num_tokens=3122368, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=1024, context_len=2176, available_gpu_mem=9.49 GB
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.42, #queue-req: 0,
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 25.99, #queue-req: 0,
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.31, #queue-req: 0,
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.20, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.08, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.12, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12.76, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 116.35, #queue-req: 0,
[2025-11-22 23:08:00 DP0 TP0 EP0] Cache flushed successfully!
[2025-11-22 23:15:42 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:18:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:21:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.89, #queue-req: 853,
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12112.66, #queue-req: 853,
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12241.95, #queue-req: 853,
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12156.81, #queue-req: 853,
Once decode has acknowledged the previous command, it is no longer sleeping and is in normal decode mode; you can capture a profile with:
You can run this multiple times to collect profiles at different times.
The script saves 5 profiling steps each run.
# Enter Docker
bash enroot_exec_first_container.sh
# Capture profile
bash 9.sglang_profile.sh
[2025-11-22 23:24:52 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1763882692.8351896)
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 6892.73, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13324.23, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13146.41, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13180.73, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Stop profiling...
[2025-11-22 23:24:56 DP0 TP0 EP0] Profiling done. Traces are saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler
To balance load across EP ranks, there are two approaches:
GB200 blog part 2 uses this approach.
Launch the decode server with --expert-distribution-recorder-mode stat and --expert-distribution-recorder-buffer-size -1.
Before starting the benchmark, start recording with bash enroot_exec_first_container.sh; bash z.1.start_record.sh.
After running bash 7.slowdown_decoder_null.sh, wait 30 minutes, then dump the recording with bash enroot_exec_first_container.sh; bash z.2.dump_record.sh.
The dump is written under ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}.
For more detail, see Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org.
Launch the decode server with --init-expert-location ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}/expert_distribution_recorder_xxx.pt.
Launch the decode server with --eplb-algorithm deepseek and --enable-eplb. The EPLB manager performs extra work to rebalance load. This uses additional GPU memory.
Example workflow for computing throughput (e.g. for EP48).
Search for Profiling starts and Stop profiling in logs/launch_server_decode_node_rank_0.log.
You should see a line like below; 1760576844.4826152 is the Torch profile id / filename stem.
[2025-10-15 18:07:24 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/6.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1760576844.4826152)
You should also see decode batch lines; #running-req: 232 is the local batch size on DP0 for that line.
[2025-10-15 18:07:24 DP0 TP0 EP0] Decode batch. #running-req: 285, #token: 328960, token usage: 0.51, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 3344.60, #queue-req: 0,
Repeat for each DP rank, e.g. local batch sizes: 285, 256, 254, 265, 271, 227, 221, 269.
Global batch size: 285 + 256 + 254 + 265 + 271 + 227 + 221 + 269 = 2048.
Open the trace torch_profiler/1760576844.4826152-TP-0.trace.json.gz.
In this example, each forward step takes 65 ms.
Per-GPU throughput: 2048 / 0.065 / 8 ≈ 3938 tokens/s/GPU.
SGLang developer guide: bench_serving
Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G | LMSYS Org
This SGLang PR includes B200 TP8 scripts and measured results; useful as a reference when setting up GB200 EP8.
Awesome-ML-SYS-Tutorial/sglang/code-walk-through
FlashInfer is a SGLang backend; the paper explains low-level design choices.
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
