Version
v0.10.0-rc01-0-g7251824d
Describe the bug.
Summary
A host doing automatic DPU reprovisioning was stuck in Reprovisioning/PowerDown for ~14 hours.
The issue appears to be a race between DPU reprovisioning and power manager:
- DPU reprovisioning needs the host powered off.
- Power manager sees the host off and powers it back on because desired state is
On.
- DPU reprovisioning powers it off again.
- The host loops and does not progress.
Affected Host
Site: ytl-shard-1
Host: fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg
DPU: fm100dsq65953kegar73t1jv2div28u6l36gniejjg9ju5636n0mcdv2lg0
Observed Behavior
Host remained in:
Reprovisioning/PowerDown
The BMC console showed the host trying to PXE boot, while Carbide reported it could not continue booting due to invalid state.
Developer log analysis indicated the state machine was waiting for DPUs to come up, while power manager was interfering with the reprovisioning power sequence.
Expected Behavior
During DPU reprovisioning, power manager should not interfere with the host power sequence owned by the reprovisioning state machine.
Host reprovisioning already appears to have this protection; DPU reprovisioning likely needs the same treatment.
Workaround
Disabling power manager allowed the host to move forward:
carbide-admin-cli mh power-options update fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg \ --desired-power-state power-manager-disabled
After this, the host progressed to:
Reprovisioning/VerifyFirmwareVersions
Power manager should be re-enabled after reprovisioning completes:
carbide-admin-cli mh power-options update fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg \ --desired-power-state on
Acceptance Criteria
- DPU reprovisioning does not get stuck in Reprovisioning/PowerDown.
- Power manager does not power the host back on while DPU reprovisioning owns the power flow.
- DPU reprovisioning has equivalent power-manager protection to host reprovisioning.
- Add test coverage for this power-manager interaction.
Minimum reproducible example
Reprovision DPU on SMC x86 host in ytl-shard-1 or ytl.
Relevant log output
Other/Misc.
No response
Code of Conduct
Version
v0.10.0-rc01-0-g7251824d
Describe the bug.
Summary
A host doing automatic DPU reprovisioning was stuck in
Reprovisioning/PowerDownfor ~14 hours.The issue appears to be a race between DPU reprovisioning and power manager:
On.Affected Host
Site: ytl-shard-1Host: fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hgDPU: fm100dsq65953kegar73t1jv2div28u6l36gniejjg9ju5636n0mcdv2lg0Observed Behavior
Host remained in:
Reprovisioning/PowerDown
The BMC console showed the host trying to PXE boot, while Carbide reported it could not continue booting due to invalid state.
Developer log analysis indicated the state machine was waiting for DPUs to come up, while power manager was interfering with the reprovisioning power sequence.
Expected Behavior
During DPU reprovisioning, power manager should not interfere with the host power sequence owned by the reprovisioning state machine.
Host reprovisioning already appears to have this protection; DPU reprovisioning likely needs the same treatment.
Workaround
Disabling power manager allowed the host to move forward:
carbide-admin-cli mh power-options update fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg \ --desired-power-state power-manager-disabledAfter this, the host progressed to:
Reprovisioning/VerifyFirmwareVersions
Power manager should be re-enabled after reprovisioning completes:
carbide-admin-cli mh power-options update fm100ht9qp740o420uj6ampj1tslals9rbivj8550rua1hci3r1uu3pv6hg \ --desired-power-state onAcceptance Criteria
Minimum reproducible example
Reprovision DPU on SMC x86 host in ytl-shard-1 or ytl.Relevant log output
Other/Misc.
No response
Code of Conduct