Skip to content

bug: DPU reprovisioning stuck in Reprovisioning/PowerDown due to power manager race #1958

@williampnvidia

Description

@williampnvidia

Version

v0.10.0-rc01-0-g7251824d

Describe the bug.

Summary

A host doing automatic DPU reprovisioning was stuck in Reprovisioning/PowerDown for ~14 hours.

The issue appears to be a race between DPU reprovisioning and power manager:

  • DPU reprovisioning needs the host powered off.
  • Power manager sees the host off and powers it back on because desired state is On.
  • DPU reprovisioning powers it off again.
  • The host loops and does not progress.

Affected Host

Site: ytl-shard-1
Host: fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg
DPU: fm100dsq65953kegar73t1jv2div28u6l36gniejjg9ju5636n0mcdv2lg0

Observed Behavior

Host remained in:

Reprovisioning/PowerDown

The BMC console showed the host trying to PXE boot, while Carbide reported it could not continue booting due to invalid state.

Developer log analysis indicated the state machine was waiting for DPUs to come up, while power manager was interfering with the reprovisioning power sequence.

Expected Behavior

During DPU reprovisioning, power manager should not interfere with the host power sequence owned by the reprovisioning state machine.

Host reprovisioning already appears to have this protection; DPU reprovisioning likely needs the same treatment.

Workaround

Disabling power manager allowed the host to move forward:

carbide-admin-cli mh power-options update fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg \ --desired-power-state power-manager-disabled

After this, the host progressed to:

Reprovisioning/VerifyFirmwareVersions

Power manager should be re-enabled after reprovisioning completes:

carbide-admin-cli mh power-options update fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg \ --desired-power-state on

Acceptance Criteria

  • DPU reprovisioning does not get stuck in Reprovisioning/PowerDown.
  • Power manager does not power the host back on while DPU reprovisioning owns the power flow.
  • DPU reprovisioning has equivalent power-manager protection to host reprovisioning.
  • Add test coverage for this power-manager interaction.

Minimum reproducible example

Reprovision DPU on SMC x86 host in ytl-shard-1 or ytl.

Relevant log output

Other/Misc.

No response

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)

    Type

    No fields configured for Bug.

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions