Skip to content

feat: add automatic BIOS setup recovery + refactor#1931

Merged
krish-nvidia merged 9 commits into
NVIDIA:mainfrom
krish-nvidia:retry-machine-setup
May 28, 2026
Merged

feat: add automatic BIOS setup recovery + refactor#1931
krish-nvidia merged 9 commits into
NVIDIA:mainfrom
krish-nvidia:retry-machine-setup

Conversation

@krish-nvidia
Copy link
Copy Markdown
Contributor

@krish-nvidia krish-nvidia commented May 26, 2026

Description

This PR adds bounded recovery for hosts stuck in BIOS setup polling during ingestion and assigned host platform configuration. PollingBiosSetup now tracks a retry count; if BIOS setup remains incomplete for more than the default 15-minute threshold, the state machine enters HandleBiosJobFailure, powers the host off, resets the BMC, powers back on, and re-runs machine_setup.

Dell BIOS job failures and non-Dell stuck PollingBiosSetup recovery now share the same retry budget, so both paths fail consistently once automated recovery is exhausted. When that happens, the machine moves to BiosSetupFailed; that failed state can automatically resume to boot-order configuration once is_bios_setup() reports success, allowing manual remediation by operators.

The following values are configurable in nico-config under [machine_state_controller]:

  • max_bios_config_retries (default 3): max HandleBiosJobFailure recovery cycles before the host is moved to Failed.
  • polling_bios_setup_stuck_threshold (default 15m): how long PollingBiosSetup may sit on is_bios_setup() == false before escalating into recovery.

This also refactors the BIOS setup/job handling logic out of handler.rs into crates/api/src/state_controller/machine/handler/bios_config.rs, keeping machine setup, Dell BIOS job polling, stuck polling escalation, and failed-state recovery in one focused module.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

#1846

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia requested review from a team and Coco-Ben as code owners May 26, 2026 19:40
Copy link
Copy Markdown
Contributor

@poroh poroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nits that are nice to have.

Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 26, 2026

@krish-nvidia is this the fix for #1846? If so can you add it to your commit message so we get the link properly?

@krish-nvidia
Copy link
Copy Markdown
Contributor Author

is this the fix for #1846? If so can you add it to your commit message so we get the link properly?

Yes it is, and I've already added it to the PR description (it also shows up in the github issue). Is there something else I have to do?

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia requested a review from poroh May 27, 2026 13:31
Comment thread crates/api/src/state_controller/machine/handler/bios_config.rs Outdated
Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia merged commit 6bd445a into NVIDIA:main May 28, 2026
54 checks passed
@krish-nvidia krish-nvidia deleted the retry-machine-setup branch May 28, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants