[Models] fix fleet model fallback ep init by xiaoguoguo626807 · Pull Request #8039 · PaddlePaddle/FastDeploy

xiaoguoguo626807 · 2026-06-11T07:01:31Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-11 15:17:38

📋 Review 摘要

PR 概述：调整 PaddleFleet fallback 的生成校验与 EP/TP 初始化流程。
变更范围：fastdeploy/config.py、fastdeploy/model_executor/models/paddleformers/base_fleet.py
影响面 Tag：[FDConfig] [Models]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/models/paddleformers/base_fleet.py:420`	重新 `fleet.init()` 后仍可能复用旧的 PaddleFleet TP group

📝 PR 规范检查

标题包含官方 [Models] Tag，符合标题规范；PR 描述的 Motivation、Modifications、Usage or Command、Accuracy Tests 仍是模板占位，建议替换为以下完整内容。

标题建议（可直接复制）：

[Models] Fix PaddleFleet fallback EP initialization

PR 描述建议（点击展开，可直接复制）

## Motivation
修复 `--model-impl paddlefleet` fallback 场景下生成模型校验与 Expert Parallel 初始化状态复用问题，避免 PaddleFleet 初始化时使用不匹配的 TP/EP parallel state。

## Modifications
- `fastdeploy/config.py`：允许 `model_impl=paddlefleet` 在 `runner_type=generate` 且模型未被 FastDeploy 原生 registry 标记为生成模型时继续走 fallback。
- `fastdeploy/model_executor/models/paddleformers/base_fleet.py`：为 PaddleFleet fallback 配置 CPU 初始化和跳过参数初始化，并在 `fleet.init()` 前重置 Paddle Fleet hybrid topology。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

这个 PR 的方向和问题域是合理的，但当前实现会在已有 PaddleFleet TP group 非空时跳过重新初始化，仍可能保留旧拓扑状态。建议先修复该初始化一致性问题，并同步补充/更新 PaddleFleet fallback 初始化路径的回归验证。

PaddlePaddle-bot · 2026-06-11T07:20:33Z

-
-            # Check if we need to initialize or reinitialize TP group
-            need_init = False
            if parallel_state._TENSOR_MODEL_PARALLEL_GROUP is None:


🔴 Bug 这里只在 _TENSOR_MODEL_PARALLEL_GROUP is None 时才用新的 HCG 初始化 PaddleFleet TP group，会把已有但拓扑已过期的 TP group 保留下来。

本函数前面只重置了 paddle.distributed.fleet.base.topology._HYBRID_PARALLEL_GROUP 和 paddle.distributed.parallel_helper.__parallel_ctx__clz__，没有清理 paddlefleet.parallel_state._TENSOR_MODEL_PARALLEL_GROUP。如果该全局已经由旧的 FastDeploy/Fleet 拓扑初始化过，fleet.init() 重建 HCG 后这里会直接跳过 parallel_state.initialize_model_parallel(hcg)，PaddleFleet 的 Column/RowParallelLinear 仍按旧 TP group 分片，和新的 EP/TP 拓扑不一致。

建议修复方式：在重新 fleet.init() 前同步清理 PaddleFleet parallel_state 中的 TP group/global ranks，或保留旧代码里的 group size/topology mismatch 检查；发现当前 group 与 parallel_config.tensor_parallel_size 或新 HCG 不一致时，必须用 fleet.get_hybrid_communicate_group() 重新调用 parallel_state.initialize_model_parallel(hcg)。

codecov-commenter · 2026-06-11T07:42:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@f161fea). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8039   +/-   ##
==========================================
  Coverage           ?   67.71%           
==========================================
  Files              ?      471           
  Lines              ?    66341           
  Branches           ?    10211           
==========================================
  Hits               ?    44925           
  Misses             ?    18549           
  Partials           ?     2867

Flag	Coverage Δ
GPU	`77.78% <100.00%> (?)`
XPU	`6.99% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-06-11T15:39:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-12 21:32:45

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: f5a72c4 | Merge base: f161fea (branch: develop)

1 Required任务 : 9/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	38	4	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题	高	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题（置信度: 高）

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例:

用例	错误摘要
`tests/model_executor/fallback/test_fallback_fleet_model_coverge.py::TestInitPaddlefleetParallelState::test_seed_assertion_error_is_silenced`	`initialize_model_parallel` 触发 PaddleFleet 全局 memory buffer 已初始化断言

关键日志:

tests/model_executor/fallback/test_fallback_fleet_model_coverge.py:1070: model._init_paddlefleet_parallel_state(fd_config)
fastdeploy/model_executor/models/paddleformers/base_fleet.py:422: parallel_state.initialize_model_parallel(hcg)
/usr/local/lib/python3.10/dist-packages/paddlefleet/parallel_state.py:426: assert _GLOBAL_MEMORY_BUFFER is None
AssertionError: global memory buffer is already initialized

根因摘要: PR改为初始化PaddleFleet时未重置全局缓冲
PR 在 fastdeploy/model_executor/models/paddleformers/base_fleet.py:420-422 删除了原先 TP=1 手动建组/TP size mismatch 分支，改为只要 _TENSOR_MODEL_PARALLEL_GROUP is None 就调用 parallel_state.initialize_model_parallel(hcg)。失败用例将 _TENSOR_MODEL_PARALLEL_GROUP 置空后，本应验证 model_parallel_cuda_manual_seed 的 AssertionError 会被 base_fleet.py:428-431 吞掉，但新逻辑先进入 initialize_model_parallel，被 PaddleFleet 已存在的 _GLOBAL_MEMORY_BUFFER 断言拦截，seed 分支没有执行。因此这是本 PR 的 base_fleet.py 初始化逻辑变更直接触发的单测失败。

修复建议:

在 base_fleet.py 调用 parallel_state.initialize_model_parallel(hcg) 前，使用 PaddleFleet/Paddle 的正式 reset/destroy API 清理已有 parallel_state 全局状态，至少要覆盖 _GLOBAL_MEMORY_BUFFER；若 TP=1 不需要重新初始化 model parallel，则保留或恢复原 TP=1 手动建组路径。
若新初始化语义是预期行为，同步更新 tests/model_executor/fallback/test_fallback_fleet_model_coverge.py:1049 附近的用例，在只验证 seed 断言吞掉时 mock 或重置 ps.initialize_model_parallel 相关全局状态，并清理仍描述旧 TP=1 分支的断言。

关联变更: fastdeploy/model_executor/models/paddleformers/base_fleet.py:399-422；PR 删除旧 TP=1/mismatch 分支并新增 fleet 状态重置后统一初始化。

xiaoguoguo626807 had a problem deploying to Metax_ci June 11, 2026 07:01 — with GitHub Actions Error

support ep

f5a72c4

xiaoguoguo626807 force-pushed the fleet_graph branch from 89c6e7d to f5a72c4 Compare June 11, 2026 07:04

xiaoguoguo626807 had a problem deploying to Metax_ci June 11, 2026 07:05 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Models] fix fleet model fallback ep init#8039

[Models] fix fleet model fallback ep init#8039
xiaoguoguo626807 wants to merge 1 commit into
PaddlePaddle:developfrom
xiaoguoguo626807:fleet_graph

xiaoguoguo626807 commented Jun 11, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 11, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiaoguoguo626807 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 Required任务 : 9/10 通过

2 失败详情

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiaoguoguo626807 commented Jun 11, 2026 •

edited

Loading

codecov-commenter commented Jun 11, 2026 •

edited

Loading

PaddlePaddle-bot commented Jun 11, 2026 •

edited

Loading