Skip to content

fix: DPU Extension Service status only include used DPUs#2323

Open
hanyux-nv wants to merge 2 commits into
NVIDIA:mainfrom
hanyux-nv:es_dual_dpu
Open

fix: DPU Extension Service status only include used DPUs#2323
hanyux-nv wants to merge 2 commits into
NVIDIA:mainfrom
hanyux-nv:es_dual_dpu

Conversation

@hanyux-nv

Copy link
Copy Markdown
Contributor

Description

This PR fixes a DPU Extension Service status aggregation nvbug on multi-DPU hosts. Previously, the extension service status aggregation considered every DPU associated with the instance, even when an DPU(secondary DPU) is not being used. As a result, since any unused DPU will never report extension services as Running, the overall DPU Extension Service status for the instance will not reach Running either, preventing instance status from reaching Ready.

This PR adds filtering to only the DPUs actually used by the instance, which is derived from instance network config using get_used_dpus helper. Note this filtering is added before:

  • Instance deployment when the state machine evaluates extension service readiness in WaitingForExtensionServicesConfig state
  • Instance status queries when tenants read instance status through RPC call
    But not before
  • instance termination check, when the state machine waits for extension services to be Terminated in WaitingForNetworkReconfig, it still checks all DPUs so no change happens there.

Also, the DPU agent side has no change. When DPU is not being used, the DPU agent does not deploy any extension services configured for the instance. in which case the agent reports the extension services' status to be "terminated". This is because from the agent's perspective, it cannot tell whether it is not being used and on admin network because 1) has an associated instance but not being configured by tenant, or 2) associated instance is being terminated.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hanyux-nv hanyux-nv requested a review from bcavnvidia June 9, 2026 04:54
@hanyux-nv hanyux-nv requested a review from a team as a code owner June 9, 2026 04:54
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7bc73ff7-11c8-40a3-87cc-578bcb1363b3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@wminckler

wminckler commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

you might have a conflict, or an error in resolving conflicts. (build failed to compile)

Comment thread crates/api-model/src/instance/config/network.rs Outdated
Comment thread crates/api-model/src/instance/config/network.rs Outdated
Comment thread crates/machine-controller/src/handler.rs Outdated
Signed-off-by: Felicity Xu <hanyux@nvidia.com>
@hanyux-nv hanyux-nv force-pushed the es_dual_dpu branch 2 times, most recently from df89332 to ebb4497 Compare June 10, 2026 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants