Skip to content

Support AMD MIxxx double-die #1097

@benoit-cty

Description

@benoit-cty

Description:

The AMD Instinct MI250 accelerator card contains two Graphics Compute Dies (GCDs) per physical card. However, when monitoring energy consumption (e.g., via rocm-smi or tools like CodeCarbon), only one GCD reports power usage, while the other shows zero values. This is problematic for accurate energy accounting, especially in HPC/SLURM environments where jobs may be allocated a single GCD.
Expected Behavior:
Both GCDs on the same MI250 card should report their individual power consumption, or the total card power should be clearly attributed to the active GCD(s).

Current Behavior:

Only one GCD provides non-zero power readings.
The second GCD always reports 0W, even when under load.
This leads to underestimated energy measurements and complicates per-job accounting.

Steps to Reproduce:

Allocate a single GCD on an MI250 card via SLURM (e.g., --gres=gpu:1).
Run a workload on the allocated GCD.
Use rocm-smi --showpower or similar tools to monitor energy.
Observe that the second GCD (on the same card) reports 0W, despite the card’s total power draw.

Impact:

Inaccurate energy tracking for jobs sharing a card.
Difficulty distinguishing per-GCD power usage.
Tools like CodeCarbon may misreport energy if they rely on per-GCD metrics.
Suggested Fix:

Provide a way to query total card power (sum of both GCDs) when monitoring a single GCD.
Alternatively, expose power readings for both GCDs, even if only one is allocated to a job.

Context:

This issue affects users in SLURM/HPC environments where fine-grained energy monitoring is critical for carbon footprint tracking and resource management.

Additional Notes:

The MI300 series may have similar behavior; clarification would be helpful.
Workarounds (e.g., manually summing GCDs) are error-prone and not scalable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions