[Models] fix fleet model fallback ep init#8039
Conversation
89c6e7d to
f5a72c4
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-11 15:17:38
📋 Review 摘要
PR 概述:调整 PaddleFleet fallback 的生成校验与 EP/TP 初始化流程。
变更范围:fastdeploy/config.py、fastdeploy/model_executor/models/paddleformers/base_fleet.py
影响面 Tag:[FDConfig] [Models]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/models/paddleformers/base_fleet.py:420 |
重新 fleet.init() 后仍可能复用旧的 PaddleFleet TP group |
📝 PR 规范检查
标题包含官方 [Models] Tag,符合标题规范;PR 描述的 Motivation、Modifications、Usage or Command、Accuracy Tests 仍是模板占位,建议替换为以下完整内容。
标题建议(可直接复制):
[Models] Fix PaddleFleet fallback EP initialization
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 `--model-impl paddlefleet` fallback 场景下生成模型校验与 Expert Parallel 初始化状态复用问题,避免 PaddleFleet 初始化时使用不匹配的 TP/EP parallel state。
## Modifications
- `fastdeploy/config.py`:允许 `model_impl=paddlefleet` 在 `runner_type=generate` 且模型未被 FastDeploy 原生 registry 标记为生成模型时继续走 fallback。
- `fastdeploy/model_executor/models/paddleformers/base_fleet.py`:为 PaddleFleet fallback 配置 CPU 初始化和跳过参数初始化,并在 `fleet.init()` 前重置 Paddle Fleet hybrid topology。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
这个 PR 的方向和问题域是合理的,但当前实现会在已有 PaddleFleet TP group 非空时跳过重新初始化,仍可能保留旧拓扑状态。建议先修复该初始化一致性问题,并同步补充/更新 PaddleFleet fallback 初始化路径的回归验证。
|
|
||
| # Check if we need to initialize or reinitialize TP group | ||
| need_init = False | ||
| if parallel_state._TENSOR_MODEL_PARALLEL_GROUP is None: |
There was a problem hiding this comment.
🔴 Bug 这里只在 _TENSOR_MODEL_PARALLEL_GROUP is None 时才用新的 HCG 初始化 PaddleFleet TP group,会把已有但拓扑已过期的 TP group 保留下来。
本函数前面只重置了 paddle.distributed.fleet.base.topology._HYBRID_PARALLEL_GROUP 和 paddle.distributed.parallel_helper.__parallel_ctx__clz__,没有清理 paddlefleet.parallel_state._TENSOR_MODEL_PARALLEL_GROUP。如果该全局已经由旧的 FastDeploy/Fleet 拓扑初始化过,fleet.init() 重建 HCG 后这里会直接跳过 parallel_state.initialize_model_parallel(hcg),PaddleFleet 的 Column/RowParallelLinear 仍按旧 TP group 分片,和新的 EP/TP 拓扑不一致。
建议修复方式:在重新 fleet.init() 前同步清理 PaddleFleet parallel_state 中的 TP group/global ranks,或保留旧代码里的 group size/topology mismatch 检查;发现当前 group 与 parallel_config.tensor_parallel_size 或新 HCG 不一致时,必须用 fleet.get_hybrid_communicate_group() 重新调用 parallel_state.initialize_model_parallel(hcg)。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #8039 +/- ##
==========================================
Coverage ? 67.71%
==========================================
Files ? 471
Lines ? 66341
Branches ? 10211
==========================================
Hits ? 44925
Misses ? 18549
Partials ? 2867
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 9/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.