You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is this a new feature, an enhancement, or a change to existing functionality?
Change
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem this feature solves
Summary
Automatic DPU reprovisioning can hit DPU BFB installer defects, particularly in DOCA 3.2.2 where DPU Golden Image update times out. In affected cases, hosts and DPUs get stuck in VerifyFirmareVersions state with NIC FW update failures.
The state-controller problem is that current DPU remediation reboots don't fix this failure mode and instead delay the path to external remediation up to 6h. For this defect class, the controller should stop retrying reboot based remediation and allow the workflow to park so external remediation can proceed immediately after a single host power cycle.
Observed Behavior
BF3 systems upgraded to 3.2.2 were stuck in ingestion at WaitingForPlatformConfiguration, with NIC FW update failures logged during reprovision
After automatic DPU reprovision on 3.2.2, some hosts were stuck in VerifyFirmareVersions state.
Repeated DPU remediation reboots don't recover these hosts. They only elongate time to remediation before an operator can perform the known manual recovery flow.
Feature Description
Expected Behavior
Carbide state-controller should not keep issuing DPU remediation reboots that don't change outcome
After a single host power cycle, the workflow should transition to a parked state, allowing external-remediation path instead of continuing reboot-based retries
Is this a new feature, an enhancement, or a change to existing functionality?
Change
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem this feature solves
Summary
Automatic DPU reprovisioning can hit DPU BFB installer defects, particularly in DOCA 3.2.2 where DPU Golden Image update times out. In affected cases, hosts and DPUs get stuck in VerifyFirmareVersions state with NIC FW update failures.
The state-controller problem is that current DPU remediation reboots don't fix this failure mode and instead delay the path to external remediation up to 6h. For this defect class, the controller should stop retrying reboot based remediation and allow the workflow to park so external remediation can proceed immediately after a single host power cycle.
Observed Behavior
Repeated DPU remediation reboots don't recover these hosts. They only elongate time to remediation before an operator can perform the known manual recovery flow.
Feature Description
Expected Behavior
Describe your ideal solution
No response
Describe any alternatives you have considered
No response
Additional context
No response
Code of Conduct