Skip to content

bug: Machine enters Error during automatic DPU firmware update instead of Initializing #1956

@williampnvidia

Description

@williampnvidia

Version

v0.10.0-rc01-0-g7251824d

Describe the bug.

Summary

After an instance was deleted, the Machine returned to Ready and then started an automatic DPU firmware update. During that update, the Machine moved to Error.

The Machine health shows a HostUpdateInProgress alert for DpuFirmware with PreventAllocations. The current REST status derivation appears to treat any PreventAllocations alert as Error, but this specific alert represents an in-progress automatic DPU firmware update.

Observed Behavior

Machine status becomes:

Error

Expected Behavior

Machine should report Initializing while the automatic DPU firmware update is running, rather than Error.

The PreventAllocations classification should still prevent new allocations while the update is active.

Likely Cause

REST machine status derivation treats any health alert with PreventAllocations as Error.

This is valid for many failure cases, but HostUpdateInProgress with target DpuFirmware is an expected update workflow state.

Acceptance Criteria

  • A Machine with HostUpdateInProgress / DpuFirmware / AutomaticDpuFirmwareUpdate health maps to Initializing, not
    Error.
  • The Machine remains unavailable for new allocation while the PreventAllocations alert is active.
  • Other health alerts with PreventAllocations still map to Error.
  • Test coverage includes the observed payload, including the paired HeartbeatTimeout alert.

Minimum reproducible example

Relevant log output

Other/Misc.

No response

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Labels

bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)

Type

No fields configured for Bug.

Projects

Status

Code Complete

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions