Skip to content

fix: stop DPU remediation reboot retries #1968

@akorobkov-nvda

Description

@akorobkov-nvda

Is this a new feature, an enhancement, or a change to existing functionality?

Change

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

Summary

Automatic DPU reprovisioning can hit DPU BFB installer defects, particularly in DOCA 3.2.2 where DPU Golden Image update times out. In affected cases, hosts and DPUs get stuck in VerifyFirmareVersions state with NIC FW update failures.

The state-controller problem is that current DPU remediation reboots don't fix this failure mode and instead delay the path to external remediation up to 6h. For this defect class, the controller should stop retrying reboot based remediation and allow the workflow to park so external remediation can proceed immediately after a single host power cycle.

Observed Behavior

  1. BF3 systems upgraded to 3.2.2 were stuck in ingestion at WaitingForPlatformConfiguration, with NIC FW update failures logged during reprovision
  2. After automatic DPU reprovision on 3.2.2, some hosts were stuck in VerifyFirmareVersions state.

Repeated DPU remediation reboots don't recover these hosts. They only elongate time to remediation before an operator can perform the known manual recovery flow.

Feature Description

Expected Behavior

  • Carbide state-controller should not keep issuing DPU remediation reboots that don't change outcome
  • After a single host power cycle, the workflow should transition to a parked state, allowing external-remediation path instead of continuing reboot-based retries
  • External remediation should be possible without waiting hours for repeated controller retries. See feat: Add setting to automatically reprovision DPUs if they're stuck after SLA #993

Describe your ideal solution

No response

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Labels

featureFeature (deprecated - use issue type, but it's needed for reporting now)
No fields configured for Enhancement.

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions