chore(api): drop gating from the allow_zero_dpu_hosts flag#1976
Open
chet wants to merge 1 commit into
Open
Conversation
This rips out the `allow_zero_dpu_hosts` flag entirely. With per-host and site-wide `dpu_mode` available, we shouldn't need to say "*allow zero DPU hosts*" -- we should simply express what `ExpectedMachines` are *expected* to be `DpuMode::NoDpu` (or `::NicMode`, or `::DpuMode`), which we can also set at the site level now too. For example we explicitly say: - If an `ExpectedMachine` is marked `DpuMode::DpuMode` (per-host or site-level default), we *expect* DPUs; if `site-explorer` doesn't find any, it logs a warning and skips ingestion. - If an `ExpectedMachine` is marked as `DpuMode::NoDpu`, we *DON'T* expect DPUs; if the BMC reports some, `site-explorer` ignores them and ingests the host as zero-DPU. - If an `ExpectedMachine` is marked as `DpuMode::NicMode`, `site-explorer` effectively ingests the host as zero-DPU; when it explores a `NicMode` host with DPUs, it will `set_nic_mode` on them as needed, naturally removing them from the DPU list across exploration cycles until all DPUs are reporting `NicMode`. ...no need for saying `allow_zero_dpu_hosts`. If an `ExpectedMachine` isn't marked `NoDpu`, then we don't expect it to be. If it is, then we do. Any configs out there that still have `allow_zero_dpu_hosts = true` will keep parsing fine -- it just won't do anything going forward; setting `dpu_mode: no_dpu` or `dpu_mode: nic_mode` (or `dpu_mode: dpu_mode`) is all that needs to be set on the `ExpectedMachine` (or site-level config). Went through existing tests to make sure we've got good coverage (and added some). Table below: | Case | Action | Coverage | |----------------------------------------------|--------------------------------------------------------|---------------------------------------------------------------------------| | Zero-DPU host w/ `::NoDpu` set | ingest as zero-DPU | `test_site_explorer_ingests_no_dpu_host` (and `test_site_explorer_main`) | | Zero-DPU host, w/ `::DpuMode` set | skip/warn/metric | `test_site_explorer_skips_unexpected_zero_dpu_host` | | Zero-DPU host w/ `::NicMode` set | ingest as zero-DPU | `test_site_explorer_ingests_nic_mode_host_with_no_observed_dpus` | | `NicMode` set w/ DPU still in DPU mode | fix via `set_nic_mode` | `test_site_explorer_auto_corrects_nic_mode_per_expected_machine` | | `Err(NoDpu) + expected_dpu_count == 0` | return `Ok(None)` | `handle_no_dpu_error_treats_as_ok_on_zero_dpu_host` | | `Err(NoDpu) + expected_dpu_count > 0` | return the error | `handle_no_dpu_error_surfaces_when_dpus_were_expected` | | BMC call succeeded (any `expected_dpu_count`) | return the success untouched | `handle_no_dpu_error_passes_through_success` | | BMC failed for a non-`NoDpu` reason (e.g. lockdown) | return the error untouched (only `NoDpu` is ever swallowed) | `handle_no_dpu_error_does_not_touch_other_errors` | | Integration e2e for Zero-DPU (machine-a-tron) | (None) | `test_machine_a_tron_singledpu_nic_mode` + integration tests | Signed-off-by: Chet Nichols III <chetn@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This rips out the
allow_zero_dpu_hostsflag entirely. With per-host and site-widedpu_modeavailable, we shouldn't need to say "allow zero DPU hosts" -- we should simply express whatExpectedMachinesare expected to beDpuMode::NoDpu(or::NicMode, or::DpuMode), which we can also set at the site level now too.For example we explicitly say:
ExpectedMachineis markedDpuMode::DpuMode(per-host or site-level default), we expect DPUs; ifsite-explorerdoesn't find any, it logs a warning and skips ingestion.ExpectedMachineis marked asDpuMode::NoDpu, we DON'T expect DPUs; if the BMC reports some,site-explorerignores them and ingests the host as zero-DPU.ExpectedMachineis marked asDpuMode::NicMode,site-explorereffectively ingests the host as zero-DPU; when it explores aNicModehost with DPUs, it willset_nic_modeon them as needed, naturally removing them from the DPU list across exploration cycles until all DPUs are reportingNicMode....no need for saying
allow_zero_dpu_hosts. If anExpectedMachineisn't markedNoDpu, then we don't expect it to be. If it is, then we do.Any configs out there that still have
allow_zero_dpu_hosts = truewill keep parsing fine -- it just won't do anything going forward; settingdpu_mode: no_dpuordpu_mode: nic_mode(ordpu_mode: dpu_mode) is all that needs to be set on theExpectedMachine(or site-level config).Went through existing tests to make sure we've got good coverage (and added some). Table below:
::NoDpusettest_site_explorer_ingests_no_dpu_host(andtest_site_explorer_main)::DpuModesettest_site_explorer_skips_unexpected_zero_dpu_host::NicModesettest_site_explorer_ingests_nic_mode_host_with_no_observed_dpusNicModeset w/ DPU still in DPU modeset_nic_modetest_site_explorer_auto_corrects_nic_mode_per_expected_machineErr(NoDpu) + expected_dpu_count == 0Ok(None)handle_no_dpu_error_treats_as_ok_on_zero_dpu_hostErr(NoDpu) + expected_dpu_count > 0handle_no_dpu_error_surfaces_when_dpus_were_expectedexpected_dpu_count)handle_no_dpu_error_passes_through_successNoDpureason (e.g. lockdown)NoDpuis ever swallowed)handle_no_dpu_error_does_not_touch_other_errorstest_machine_a_tron_singledpu_nic_mode+ integration testsSigned-off-by: Chet Nichols III chetn@nvidia.com
Type of Change
Related Issues (Optional)
Breaking Changes
Technically it just makes it so
dpu_mode: no_dpuworks now without needing to also setallow_zero_dpu_hosts: true.Testing
Additional Notes