Decentralized disaggregated deployment architecture#947
Decentralized disaggregated deployment architecture#947fuheaven wants to merge 463 commits intoModelTC:mainfrom
Conversation
Tidy VAReader & OmniVAReader Tidy VARecorder & X264VARecorder VARecorder with stream, use buffer stream Tidy env WORKER_RANK, READER_RANK, RECORDER_RANK Support voice type choose
Co-authored-by: root <root@pt-de4c35727a1b4d1b9f27f422f06026ec-worker-0.pt-de4c35727a1b4d1b9f27f422f06026ec.ns-devsft-3460edd0.svc.cluster.local> Co-authored-by: root <root@pt-9b2035a55fe647eeb007584b238e5077-worker-0.pt-9b2035a55fe647eeb007584b238e5077.ns-devsft-3460edd0.svc.cluster.local>
ModelTC#597) …up process in DistributedManager
Co-authored-by: yihuiwen <yihuiwen@sensetime.com>
Co-authored-by: wangshankun <wangshankun@sensetime.com>
WanModel 继承自 CompiledMethodsMixin,它肯定有 compile
1. rename dcu to hygon_dcu 2. fix flash attention bug
Co-authored-by: yihuiwen <yihuiwen@sensetime.com>
Co-authored-by: qinxinyi <qxy118045534@163.com>
Co-authored-by: yihuiwen <yihuiwen@sensetime.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求引入了LightX2V框架中大型生成模型(如Wan和Qwen Image)的三段式分离部署模式。这一改进旨在通过将推理流水线拆分为独立的Encoder、Transformer和Decoder服务,显著优化显存使用、提高系统吞吐量,并支持跨设备或跨机器的灵活部署。通过集成Mooncake传输引擎和LightLLM优化,确保了数据传输的高效性和编码阶段的性能提升,从而为高分辨率、长时生成场景提供了更稳定和可扩展的解决方案。 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
本次 PR 引入了非常重要的三段式分离部署功能(Encoder + Transformer + Decoder),这是一个很棒的工程实现,可以有效优化大规模生成模型在分布式环境下的显存占用和推理吞吐。
代码实现非常全面,涵盖了从底层通信(基于 Mooncake)、核心逻辑(DisaggMixin)、与现有 Runner 的集成,到上层的配置、文档和测试脚本。整体设计考虑周全,例如:
- 使用
DisaggMixin来复用分离部署逻辑,代码结构清晰。 - 针对不同角色(encoder, transformer, decode)按需加载模型,有效降低显存。
- 包含了数据传输的哈希校验,保证了数据一致性。
- 提供了详尽的中文文档和开箱即用的启动、测试脚本,极大地降低了用户的使用门槛。
我发现了一些文档和脚本注释中的小问题,并已在具体的 review comments 中提出建议,希望能让这个功能更加完善。总体来说,这是一次高质量的提交。
| ```bash | ||
| python -m lightx2v.server \ | ||
| --model_cls wan2.1 \ | ||
| --task t2v \ | ||
| --model_path $model_path \ | ||
| --config_json ${lightx2v_path}/configs/wan/wan_t2v_disagg_decode.json \ | ||
| --host 0.0.0.0 \ | ||
| --port 8004 | ||
| ``` |
There was a problem hiding this comment.
你好,这篇文档写得非常详细,对用户理解和使用分离部署功能非常有帮助。
在 3.1 节手动启动服务的示例代码中,使用了 $model_path 和 ${lightx2v_path} 这两个环境变量。对于直接阅读这部分内容的用户来说,可能不清楚如何设置这两个变量。
建议在这里增加一个简短的说明,提醒用户需要先设置这两个环境变量,并可以参考脚本 scripts/server/disagg/wan/start_wan_t2v_disagg.sh 中的定义方式。例如:
> **注意**:以下命令中的 `$model_path` 和 `${lightx2v_path}` 变量需要提前设置。`$lightx2v_path` 应指向项目根目录,`$model_path` 应指向模型文件所在的目录。这样可以提升文档的易用性。
| # GPU_T : Transformer (port 8005) | ||
| # | ||
| # Override GPUs via environment variables: | ||
| # GPU_ENCODER=4 GPU_TRANSFORMER=5 GPU_DECODER=6 ./start_wan_i2v_disagg_all.sh |
| # GPU_T : Transformer (port 8003) | ||
| # | ||
| # Override GPUs via environment variables: | ||
| # GPU_ENCODER=4 GPU_TRANSFORMER=5 GPU_DECODER=6 ./start_wan_t2v_disagg_all.sh |
There was a problem hiding this comment.
| if self.text_encoder_type in ["lightllm_service", "lightllm_kernel"]: | ||
| logger.info(f"Using LightLLM text encoder: {self.text_encoder_type}") | ||
|
|
||
| def set_config(self, config_modify): |
There was a problem hiding this comment.
调用链:客户端 HTTP POST → server → worker.py:95 → runner.set_config(task_data) → runner.run_pipeline(input_info)
这里的逻辑是:将 HTTP 请求中扁平化的 disagg 参数,同步映射到 config["disagg_config"] 这个嵌套字典中。在去中心化模式下,客户端只需要一个 HTTP POST 就能指定"这个请求发给哪个 Transformer worker"和"使用哪个 Mooncake room"
| self.model = None | ||
| self.text_encoders = None | ||
| self.vae = self.load_vae() | ||
| else: |
There was a problem hiding this comment.
这里没有指定disagg_mode的话就会走默认的load model逻辑
| encoder_config = self.config.copy() | ||
| lightllm_config = self.config.get("lightllm_config", {}) | ||
| encoder_config.update(lightllm_config) | ||
| encoder_config = dict(self.config) |
| assert self.config.get("cpu_offload", False) | ||
| if self.config.get("disagg_mode"): | ||
| self.init_disagg(self.config) | ||
| super().init_modules() |
There was a problem hiding this comment.
这里用基类的init来代替,因为基类中init_module有一样的lazy load逻辑
| logger.info(f"Qwen Image Runner got custom shape: {width}x{height}") | ||
| return (width, height) | ||
|
|
||
| cfg_h = self.config.get("target_height") |
There was a problem hiding this comment.
这段是为了允许通过 config 中的 target_height / target_width 字段直接指定输出图片的分辨率,不必依赖aspect_ratio预设或者请求参数
| return None | ||
|
|
||
| def set_target_shape(self): | ||
| # In disagg transformer mode, use the shape transmitted from encoder |
There was a problem hiding this comment.
这里跟上面一样,是为了方便post请求新增的,因为我希望在配置文件中预设默认输出分辨率,而不需要每次 HTTP 请求都传 target_shape 或 aspect_ratio
| def run_image_encoder(self): | ||
| pass | ||
|
|
||
| @ProfilingContext4DebugL2("Load models") |
There was a problem hiding this comment.
这一段看起来是跟基类中的load model重复了,故删去
| def init_scheduler(self): | ||
| super().init_scheduler() | ||
| if self.config.get("disagg_mode") == "decode": | ||
| return |
There was a problem hiding this comment.
这里是因为decoder模式要直接返回,不能初始化schedule,所以我还搞了个nullschedule
| - disagg decode: receive_transformer_outputs → VAE → save. | ||
| """ | ||
| self.input_info = input_info | ||
| disagg_mode = self.config.get("disagg_mode") |
There was a problem hiding this comment.
这里就是按照encoder, transformer, decode进行if判断,走不同的pipeline
| def init_scheduler(self): | ||
| """Initialize scheduler""" | ||
| pass | ||
| """Initialize scheduler.""" |
There was a problem hiding this comment.
这里就是为了方便三段式decode阶段,搞了个nullschedule
Summary
Integrated Mooncake's disaggregated deployment mode into the runner to provide LightX2V with full three-stage disaggregated inference capability. The inference pipeline can be split into Encoder, Transformer, and Decoder nodes, where the VAE Decoder is deployed independently on the Decoder node. This support includes both Wan and Qwen model families.
On top of the three-stage foundation, this PR further introduces decentralized queue scheduling: a Controller process hosts RDMA metadata ring buffers, Transformer and Decoder run as pull-based workers, and the client only needs a single HTTP POST to the Encoder—no more three-way sequential requests. Multiple Transformer workers can be deployed across GPUs for parallel DiT execution.
Feature Highlights
RDMABuffer): a Controller hosts request / phase1 / phase2 metadata rings; Encoder publishes dispatch metadata after inference; Transformer and Decoder workers pull tasks from the rings automatically. The client sends one HTTP request to the Encoder instead of three sequential POSTs.receiver_engine_rank) can run on different GPUs. Requests specifydisagg_phase1_receiver_engine_rankto target a specific worker, enabling round-robin or explicit routing.rdma_faaupgraded from read-modify-write shim to realIBV_WR_ATOMIC_FETCH_AND_ADD; newrdma_cas(IBV_WR_ATOMIC_CMP_AND_SWP) added. Both RDMAServer and RDMAClient registerREMOTE_ATOMICaccess flags.queue_sizes,queue_total_pending,all_queues_empty) via the Reporter'sset_extra_metrics_provider()hook, providing real-time pipeline backlog visibility.Disaggregated Architecture (Three-Stage Pipeline)
Based on the
disagg_modeconfiguration, the inference pipeline is physically split into three independent services. Data flows through Phase1 (Encoder → Transformer) and Phase2 (Transformer → Decoder), requiring two Mooncake transfers.Encoder Role (
disagg_mode="encoder")After startup, it performs feature extraction and sends tensors through Mooncake Phase1 to the Transformer node, including:
contextclip_encoder_outvae_encoder_outlatent_shapeTransformer Role (
disagg_mode="transformer")After startup, it waits for Phase1 data. Upon receiving it, it performs:
If
decoder_engine_rankis configured, it sends the denoised latent space to the Decoder node via Mooncake Phase2, and does not perform local VAE decoding.Decoder Role (
disagg_mode="decode")After startup, it enters a Phase2 receive-and-wait state. When it receives the latent space from the Transformer, it performs:
Both task completion status and result files are stored on the Decoder node.
Decentralized Queue Scheduling
Architecture
How it differs from standard three-stage
Data flow
/v1/tasks/image/) with prompt,data_bootstrap_room(unique room ID), anddisagg_phase1_receiver_engine_rank(target Transformer rank)./v1/tasks/{task_id}/statusuntilcompleted.Key components
ControllerService.serve_rdma_dispatch_only()): hosts three RDMA ring buffers (request / phase1 / phase2), no model loading, always-on background process.rdma_buffer.py): shared ring buffer overRDMAServer/RDMAClientwith slot-level atomic coordination for multi-producer/multi-consumer JSON dispatch.qwen_t2i_queue_workers.py): Transformer and Decoder worker loops that consume from RDMA rings viadisagg_try_consume_phase1()/disagg_try_consume_phase2(), then calldisagg_transformer_prepare_dispatch()/disagg_decoder_prepare_dispatch()to set up per-request Mooncake sessions.