Node indices for each server role

You can use the scripts in this repository to reproduce the performance numbers in Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput | LMSYS Org.

Node indices for each server role

Each prefill group uses 2 nodes; the scripts start 1–3 prefill groups in total. The current scripts start 1 prefill group.

Each decode group uses 12 nodes.

The router and client both run on the first node.

Here, “client” means the benchmark workload.

Procedure

Run all commands after logging into the first node.

Sometimes you also need to enter Docker before running them.

Pull / prepare Docker

bash 0.prepare_docker.sh

Download model checkpoints

bash 0.download_model.sh

Request allocation (Slurm)

bash 1.salloc.sh

Determine the node list for each server

On the allocation shell, run:

bash 2.get_node_list_env.sh

This generates node_list_env.sh, which the later server launch scripts source.

Start prefill servers

3.launch_prefill_server.sh

Start decode servers

4.launch_decode_server.sh

Start the router

Start the router after the decode and prefill servers have finished starting.

A server (decode or prefill) is ready when its log shows: “The server is fired up and ready to roll!”.

# Enter Docker
bash enroot_exec_first_container.sh
# Start router
bash 5.launch_router.sh

Start the benchmark

# Enter Docker
bash enroot_exec_first_container.sh
# Start benchmark
bash 6.start_benchmark.sh

Slow down decode (intentional backlog)

# Enter Docker
bash enroot_exec_first_container.sh
# After the decoder receives this command, each run_batch() sleeps 180s before model forward.
bash 7.start_slow_down_decode.sh

Watch decode logs

After enabling the 180s slowdown, every decode run_batch() sleeps 180s, so KV caches produced by prefill keep accumulating.

Each decode run_batch() can therefore schedule more running-req in parallel.

The number of running requests can grow up to SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK.

Watch the decode logs; once running-req reaches SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK, run the next step: send slow_down null so decode returns to normal (no sleep).

Stop decode slowdown

# Enter Docker
bash enroot_exec_first_container.sh
# After the decoder receives this command, it stops sleeping before each forward.
bash 8.stop_slow_down_decode.sh

After sending this command, wait on the order of ~180s before decode visibly reacts.

[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph begin. This can take up to several minutes. avail mem=40.56 GB
[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph bs [1024]
[2025-11-22 23:05:27 DP0 TP0 EP0] Capture cuda graph end. Time elapsed: 19.62 s. mem usage=31.07 GB. avail mem=9.49 GB.
[2025-11-22 23:05:30 DP0 TP0 EP0] max_total_num_tokens=3122368, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=1024, context_len=2176, available_gpu_mem=9.49 GB
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.42, #queue-req: 0, 
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 25.99, #queue-req: 0, 
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.31, #queue-req: 0, 
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.20, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.08, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.12, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12.76, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 116.35, #queue-req: 0, 
[2025-11-22 23:08:00 DP0 TP0 EP0] Cache flushed successfully!
[2025-11-22 23:15:42 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:18:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:21:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.89, #queue-req: 853, 
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12112.66, #queue-req: 853, 
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12241.95, #queue-req: 853, 
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12156.81, #queue-req: 853,

Capture a Torch profile

Once decode has acknowledged the previous command, it is no longer sleeping and is in normal decode mode; you can capture a profile with:

You can run this multiple times to collect profiles at different times.

The script saves 5 profiling steps each run.

# Enter Docker
bash enroot_exec_first_container.sh
# Capture profile
bash 9.sglang_profile.sh

[2025-11-22 23:24:52 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1763882692.8351896)
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 6892.73, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13324.23, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13146.41, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13180.73, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Stop profiling...
[2025-11-22 23:24:56 DP0 TP0 EP0] Profiling done. Traces are saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler

Load balance between EP ranks

To balance load across EP ranks, there are two approaches:

Solution 1: use a pre-recorded expert distribution file to initialize expert placement

GB200 blog part 2 uses this approach.

Step 1: create expert distribution data

Launch the decode server with --expert-distribution-recorder-mode stat and --expert-distribution-recorder-buffer-size -1.

Before starting the benchmark, start recording with bash enroot_exec_first_container.sh; bash z.1.start_record.sh.

After running bash 7.slowdown_decoder_null.sh, wait 30 minutes, then dump the recording with bash enroot_exec_first_container.sh; bash z.2.dump_record.sh.

The dump is written under ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}.

For more detail, see Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org.

Step 2: pass `--init-expert-location` pointing at the dumped file

Launch the decode server with --init-expert-location ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}/expert_distribution_recorder_xxx.pt.

Solution 2: real-time rebalancing with `--eplb-algorithm deepseek` and `--enable-eplb`

Launch the decode server with --eplb-algorithm deepseek and --enable-eplb. The EPLB manager performs extra work to rebalance load. This uses additional GPU memory.

Calculating throughput

Example workflow for computing throughput (e.g. for EP48).

Global batch size

Search for Profiling starts and Stop profiling in logs/launch_server_decode_node_rank_0.log.

You should see a line like below; 1760576844.4826152 is the Torch profile id / filename stem.

[2025-10-15 18:07:24 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/6.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1760576844.4826152)

You should also see decode batch lines; #running-req: 232 is the local batch size on DP0 for that line.

[2025-10-15 18:07:24 DP0 TP0 EP0] Decode batch. #running-req: 285, #token: 328960, token usage: 0.51, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 3344.60, #queue-req: 0,

Repeat for each DP rank, e.g. local batch sizes: 285, 256, 254, 265, 271, 227, 221, 269.

Global batch size: 285 + 256 + 254 + 265 + 271 + 227 + 221 + 269 = 2048.

Duration per forward step

Open the trace torch_profiler/1760576844.4826152-TP-0.trace.json.gz.

In this example, each forward step takes 65 ms.

Throughput per GPU

Per-GPU throughput: 2048 / 0.065 / 8 ≈ 3938 tokens/s/GPU.

References

SGLang developer guide

SGLang developer guide: bench_serving

SGLang blog series

Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput | LMSYS Org

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput | LMSYS Org

Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G | LMSYS Org

DeepSeek V3 minimal example

This SGLang PR includes B200 TP8 scripts and measured results; useful as a reference when setting up GB200 EP8.

Enables TRT-LLM backend to be used for target_verify by pranavm-nvidia · Pull Request #10281 · sgl-project/sglang

SGLang code walk-through

Awesome-ML-SYS-Tutorial/sglang/code-walk-through

FlashInfer paper

FlashInfer is a SGLang backend; the paper explains low-level design choices.

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
decode_sglang_expert_distribution_recorder		decode_sglang_expert_distribution_recorder
images		images
logs_gb200_blog2_fp4		logs_gb200_blog2_fp4
0.download_model.sh		0.download_model.sh
0.prepare_docker.sh		0.prepare_docker.sh
1.salloc.sh		1.salloc.sh
2.get_node_list_env.sh		2.get_node_list_env.sh
3.launch_prefill_server_fp4.sh		3.launch_prefill_server_fp4.sh
4.launch_decode_server_fp4.sh		4.launch_decode_server_fp4.sh
5.launch_router.sh		5.launch_router.sh
6.start_benchmark.sh		6.start_benchmark.sh
7.start_slow_down_decode.sh		7.start_slow_down_decode.sh
8.stop_slow_down_decode.sh		8.stop_slow_down_decode.sh
9.sglang_profile.sh		9.sglang_profile.sh
README.md		README.md
README_CN.md		README_CN.md
download_model.py		download_model.py
enroot_exec_first_container.sh		enroot_exec_first_container.sh
launch_server_in_docker_fp4.sh		launch_server_in_docker_fp4.sh
ssh_head_node.sh		ssh_head_node.sh
z.1.start_expert_distribution_record.sh		z.1.start_expert_distribution_record.sh
z.2.dump_expert_distribution_record.sh		z.2.dump_expert_distribution_record.sh
z.3.nsys_profile.sh		z.3.nsys_profile.sh
z.blog2_High-prec_decode.sh		z.blog2_High-prec_decode.sh
z.blog2_Low-prec_decode.sh		z.blog2_Low-prec_decode.sh
z.h100_blog_dump_expert_distribution.sh		z.h100_blog_dump_expert_distribution.sh
z.print_expert_dump.py		z.print_expert_dump.py

Folders and files

Latest commit

History

Repository files navigation

Node indices for each server role

Procedure

Pull / prepare Docker

Download model checkpoints

Request allocation (Slurm)

Determine the node list for each server

Start prefill servers

Start decode servers

Start the router

Start the benchmark

Slow down decode (intentional backlog)

Watch decode logs

Stop decode slowdown

Capture a Torch profile

Load balance between EP ranks

Solution 1: use a pre-recorded expert distribution file to initialize expert placement

Step 1: create expert distribution data

Step 2: pass --init-expert-location pointing at the dumped file

Solution 2: real-time rebalancing with --eplb-algorithm deepseek and --enable-eplb

Calculating throughput

Global batch size

Duration per forward step

Throughput per GPU

References

SGLang developer guide

SGLang blog series

DeepSeek V3 minimal example

SGLang code walk-through

FlashInfer paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 2: pass `--init-expert-location` pointing at the dumped file

Solution 2: real-time rebalancing with `--eplb-algorithm deepseek` and `--enable-eplb`

Packages