Skip to content

chore(api): drop gating from the allow_zero_dpu_hosts flag#1976

Open
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:drop-allow-zero-dpu-hosts
Open

chore(api): drop gating from the allow_zero_dpu_hosts flag#1976
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:drop-allow-zero-dpu-hosts

Conversation

@chet
Copy link
Copy Markdown
Contributor

@chet chet commented May 28, 2026

Description

This rips out the allow_zero_dpu_hosts flag entirely. With per-host and site-wide dpu_mode available, we shouldn't need to say "allow zero DPU hosts" -- we should simply express what ExpectedMachines are expected to be DpuMode::NoDpu (or ::NicMode, or ::DpuMode), which we can also set at the site level now too.

For example we explicitly say:

  • If an ExpectedMachine is marked DpuMode::DpuMode (per-host or site-level default), we expect DPUs; if site-explorer doesn't find any, it logs a warning and skips ingestion.
  • If an ExpectedMachine is marked as DpuMode::NoDpu, we DON'T expect DPUs; if the BMC reports some, site-explorer ignores them and ingests the host as zero-DPU.
  • If an ExpectedMachine is marked as DpuMode::NicMode, site-explorer effectively ingests the host as zero-DPU; when it explores a NicMode host with DPUs, it will set_nic_mode on them as needed, naturally removing them from the DPU list across exploration cycles until all DPUs are reporting NicMode.

...no need for saying allow_zero_dpu_hosts. If an ExpectedMachine isn't marked NoDpu, then we don't expect it to be. If it is, then we do.

Any configs out there that still have allow_zero_dpu_hosts = true will keep parsing fine -- it just won't do anything going forward; setting dpu_mode: no_dpu or dpu_mode: nic_mode (or dpu_mode: dpu_mode) is all that needs to be set on the ExpectedMachine (or site-level config).

Went through existing tests to make sure we've got good coverage (and added some). Table below:

Case Action Coverage
Zero-DPU host w/ ::NoDpu set ingest as zero-DPU test_site_explorer_ingests_no_dpu_host (and test_site_explorer_main)
Zero-DPU host, w/ ::DpuMode set skip/warn/metric test_site_explorer_skips_unexpected_zero_dpu_host
Zero-DPU host w/ ::NicMode set ingest as zero-DPU test_site_explorer_ingests_nic_mode_host_with_no_observed_dpus
NicMode set w/ DPU still in DPU mode fix via set_nic_mode test_site_explorer_auto_corrects_nic_mode_per_expected_machine
Err(NoDpu) + expected_dpu_count == 0 return Ok(None) handle_no_dpu_error_treats_as_ok_on_zero_dpu_host
Err(NoDpu) + expected_dpu_count > 0 return the error handle_no_dpu_error_surfaces_when_dpus_were_expected
BMC call succeeded (any expected_dpu_count) return the success untouched handle_no_dpu_error_passes_through_success
BMC failed for a non-NoDpu reason (e.g. lockdown) return the error untouched (only NoDpu is ever swallowed) handle_no_dpu_error_does_not_touch_other_errors
Integration e2e for Zero-DPU (machine-a-tron) (None) test_machine_a_tron_singledpu_nic_mode + integration tests

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Technically it just makes it so dpu_mode: no_dpu works now without needing to also set allow_zero_dpu_hosts: true.

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

This rips out the `allow_zero_dpu_hosts` flag entirely. With per-host and site-wide `dpu_mode` available, we shouldn't need to say "*allow zero DPU hosts*" -- we should simply express what `ExpectedMachines` are *expected* to be `DpuMode::NoDpu` (or `::NicMode`, or `::DpuMode`), which we can also set at the site level now too.

For example we explicitly say:
- If an `ExpectedMachine` is marked `DpuMode::DpuMode` (per-host or site-level default), we *expect* DPUs; if `site-explorer` doesn't find any, it logs a warning and skips ingestion.
- If an `ExpectedMachine` is marked as `DpuMode::NoDpu`, we *DON'T* expect DPUs; if the BMC reports some, `site-explorer` ignores them and ingests the host as zero-DPU.
- If an `ExpectedMachine` is marked as `DpuMode::NicMode`, `site-explorer` effectively ingests the host as zero-DPU; when it explores a `NicMode` host with DPUs, it will `set_nic_mode` on them as needed, naturally removing them from the DPU list across exploration cycles until all DPUs are reporting `NicMode`.

...no need for saying `allow_zero_dpu_hosts`. If an `ExpectedMachine` isn't marked `NoDpu`, then we don't expect it to be. If it is, then we do.

Any configs out there that still have `allow_zero_dpu_hosts = true` will keep parsing fine -- it just won't do anything going forward; setting `dpu_mode: no_dpu` or `dpu_mode: nic_mode` (or `dpu_mode: dpu_mode`) is all that needs to be set on the `ExpectedMachine` (or site-level config).

Went through existing tests to make sure we've got good coverage (and added some). Table below:

| Case                               | Action                                  | Coverage                                                                  |
|----------------------------------------------|--------------------------------------------------------|---------------------------------------------------------------------------|
| Zero-DPU host w/ `::NoDpu` set | ingest as zero-DPU             | `test_site_explorer_ingests_no_dpu_host` (and `test_site_explorer_main`) |
| Zero-DPU host, w/ `::DpuMode` set | skip/warn/metric | `test_site_explorer_skips_unexpected_zero_dpu_host`                      |
| Zero-DPU host w/ `::NicMode` set | ingest as zero-DPU | `test_site_explorer_ingests_nic_mode_host_with_no_observed_dpus`       |
| `NicMode` set w/ DPU still in DPU mode   | fix via `set_nic_mode` | `test_site_explorer_auto_corrects_nic_mode_per_expected_machine` |
| `Err(NoDpu) + expected_dpu_count == 0`       | return `Ok(None)`     | `handle_no_dpu_error_treats_as_ok_on_zero_dpu_host`                     |
| `Err(NoDpu) + expected_dpu_count > 0`        | return the error | `handle_no_dpu_error_surfaces_when_dpus_were_expected`                |
| BMC call succeeded (any `expected_dpu_count`) | return the success untouched | `handle_no_dpu_error_passes_through_success`                          |
| BMC failed for a non-`NoDpu` reason (e.g. lockdown) | return the error untouched (only `NoDpu` is ever swallowed) | `handle_no_dpu_error_does_not_touch_other_errors`                     |
| Integration e2e for Zero-DPU (machine-a-tron) | (None)             | `test_machine_a_tron_singledpu_nic_mode` + integration tests             |

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet requested a review from a team as a code owner May 28, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant