feat: add OKE (OCI) flavor to nvidia-setup and nvidia-tuned#49
feat: add OKE (OCI) flavor to nvidia-setup and nvidia-tuned#49ayuskauskas wants to merge 16 commits into
Conversation
97f94c0 to
084a80a
Compare
📝 WalkthroughWalkthroughThis PR introduces OKE (Oracle Kubernetes Engine / OCI) platform support for Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 31
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
nvidia-setup/skyhook_dir/load_defaults.sh (1)
19-23: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winUpdate the in-file export contract comment to match the new environment surface.
The comment block still says only core vars are exported, but Lines 46-50 now export many OKE-specific variables. Keep these in sync to avoid drift.
Also applies to: 46-50
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nvidia-setup/skyhook_dir/load_defaults.sh` around lines 19 - 23, The comment block that describes the contract for load_defaults.sh (beginning with "Load nvidia-setup defaults and env overrides") currently lists only the core variables in the "Sets and exports:" section, but additional OKE-specific variables are being exported in the later section of the file (around lines 46-50). Update the "Sets and exports:" line in the comment block to include all the OKE-specific variables that are actually exported to keep the documentation in sync with the actual code behavior.nvidia-setup/skyhook_dir/steps/configure-chrony.sh (1)
1-1: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winUse
#!/usr/bin/env bashshebang instead of/bin/bash.This makes the script portable across systems where bash may be installed at different paths.
As per coding guidelines: Shell scripts must use shebang
#!/usr/bin/env bash.🐛 Fix shebang
-#!/bin/bash +#!/usr/bin/env bash🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nvidia-setup/skyhook_dir/steps/configure-chrony.sh` at line 1, Replace the shebang line in configure-chrony.sh with the portable version that uses env to locate bash. Instead of using the hardcoded path #!/bin/bash, change it to #!/usr/bin/env bash. This ensures the script will find the bash interpreter regardless of where it's installed on the system, making the script work across different system configurations.Source: Coding guidelines
nvidia-setup/skyhook_dir/steps_check/configure_chrony_check.sh (1)
1-1: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winUse
#!/usr/bin/env bashshebang instead of/bin/bash.This makes the script portable across systems where bash may be installed at different paths.
As per coding guidelines: Shell scripts must use shebang
#!/usr/bin/env bash.🐛 Fix shebang
-#!/bin/bash +#!/usr/bin/env bash🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nvidia-setup/skyhook_dir/steps_check/configure_chrony_check.sh` at line 1, The shebang line at the beginning of the script uses a hardcoded path `#!/bin/bash` which is not portable across systems where bash may be installed at different locations. Replace the shebang `#!/bin/bash` with `#!/usr/bin/env bash` so that the script will automatically locate bash from the system's PATH, ensuring compatibility across different Unix-like systems regardless of where bash is installed.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nvidia-setup/README.md`:
- Line 94: The configure_hpc_networking section in the README.md contains an
incorrect acronym "OCA" that should be "OCI" for consistency with OKE and OCI
documentation standards. Locate the line describing the configure_hpc_networking
step that mentions "OCA rdma_network.json" and change "OCA" to "OCI" to match
the proper Oracle Cloud Infrastructure terminology used throughout the
documentation.
In `@nvidia-setup/skyhook_dir/defaults/oke-gb200.conf`:
- Line 11: The LUSTRE_ARTIFACTS_URL variable in oke-gb200.conf is currently set
to a placeholder string REPLACE_WITH_OCI_LUSTRE_DEB_URL instead of an actual
deployable URL. Replace this placeholder value with the correct OCI Lustre DEB
package repository URL so that the OKE GB200 Lustre bootstrap can properly fetch
packages during deployment.
In `@nvidia-setup/skyhook_dir/defaults/oke-h100.conf`:
- Line 11: Replace the placeholder value in the LUSTRE_ARTIFACTS_URL variable
with an actual, resolvable OCI Lustre DEB package URL. The current placeholder
value REPLACE_WITH_OCI_LUSTRE_DEB_URL needs to be replaced with the correct URL
that points to the OCI Lustre DEB artifacts repository so that the default
Lustre package retrieval works properly in the OKE H100 configuration.
In `@nvidia-setup/skyhook_dir/steps_check/configure_hpc_networking_check.sh`:
- Line 1: The shebang at the start of the configure_hpc_networking_check.sh
script uses the direct bash path form which is not portable across different
systems. Replace the current shebang `#!/bin/bash` with the portable form
`#!/usr/bin/env bash` which uses the env utility to locate bash in the system's
PATH, ensuring the script works consistently across different environments
following the established coding guidelines.
In `@nvidia-setup/skyhook_dir/steps_check/configure_limits_check.sh`:
- Line 1: The shebang line in the configure_limits_check.sh script uses a
hardcoded path to bash which may not be portable across all systems. Replace the
shebang `#!/bin/bash` with `#!/usr/bin/env bash` at the first line of the file
to ensure the script can find and execute bash regardless of its installation
location on the system.
In `@nvidia-setup/skyhook_dir/steps_check/install_doca_check.sh`:
- Line 1: The shebang in the install_doca_check.sh script is using the absolute
path form #!/bin/bash which is not portable across different systems. Replace
the shebang line at the beginning of the file with #!/usr/bin/env bash to use
the portable env form that will locate bash regardless of its installation path,
ensuring consistency with coding guidelines.
In `@nvidia-setup/skyhook_dir/steps_check/install_lustre_oke_check.sh`:
- Line 1: Replace the shebang line at the beginning of the
install_lustre_oke_check.sh script from `#!/bin/bash` to `#!/usr/bin/env bash`.
This makes the script portable across systems where bash may be installed at
different paths by leveraging the env command to locate bash dynamically rather
than assuming it is at /bin/bash.
In `@nvidia-setup/skyhook_dir/steps_check/install_oci_hpc_packages_check.sh`:
- Line 1: The shebang line at the beginning of the
install_oci_hpc_packages_check.sh script uses the hardcoded path #!/bin/bash
which is not portable across different systems. Replace the shebang with
#!/usr/bin/env bash to follow coding guidelines and ensure portability, as this
method uses the env utility to locate bash dynamically rather than assuming it
exists at a fixed path.
In `@nvidia-setup/skyhook_dir/steps/configure_hpc_networking.sh`:
- Line 1: The shebang line at the start of configure_hpc_networking.sh uses
`#!/bin/bash` which is not portable across different systems where bash may be
installed in non-standard locations. Replace the shebang with `#!/usr/bin/env
bash` which uses the env command to locate bash in the system PATH, making the
script portable across different Unix-like systems and environments where bash
installation locations may vary.
- Line 19: The shell script is missing the `-u` flag in the strict mode
settings. In the configure_hpc_networking.sh script, update the `set -eo
pipefail` command to include the `-u` flag, changing it to `set -euo pipefail`.
This addition ensures that any references to unset variables will cause the
script to exit immediately, improving error detection and debugging.
In `@nvidia-setup/skyhook_dir/steps/configure_limits.sh`:
- Line 19: The strict mode configuration at the beginning of the
configure_limits.sh script uses `set -eo pipefail` which does not enable
checking for unset variables. Update this command to include the `-u` flag by
changing it to `set -euo pipefail`. This will enable the script to catch unset
variables and provide more comprehensive error handling aligned with the coding
guidelines for strict mode.
- Line 1: Replace the shebang line in configure_limits.sh from `#!/bin/bash` to
`#!/usr/bin/env bash` to ensure portability across systems where bash may be
installed at different paths. The shebang is the first line of the script and
should be updated to use the env command to locate bash dynamically.
In `@nvidia-setup/skyhook_dir/steps/ensure_kernel.sh`:
- Around line 78-84: The flavor extraction using parameter expansion
`${KERNEL##*linux-}` silently produces an empty string if KERNEL does not
contain "linux-", causing the subsequent grep validation to always succeed
incorrectly (since grep -q "" matches everything). Before performing the flavor
extraction in the Meta/vendor kernels validation block, add a check to ensure
that KERNEL actually contains the "linux-" substring, and if it doesn't, exit
with an appropriate error message indicating that KERNEL format is invalid. This
ensures that the validation only proceeds when flavor extraction results in a
non-empty string.
In `@nvidia-setup/skyhook_dir/steps/files/oke/fix-ofed-symlinks.sh`:
- Line 1: The shell script fix-ofed-symlinks.sh is missing the required Apache
2.0 license header at the beginning of the file. Add the full Apache 2.0 license
header from .github/license-header.tmpl to the top of the script before the
shebang line. You can run the make license-fmt command from the repository root
to automatically add the required header to this file.
In `@nvidia-setup/skyhook_dir/steps/files/oke/lustre-modules-setup`:
- Around line 1-5: The lustre-modules-setup shell script is missing the required
Apache 2.0 license header at the beginning of the file. Add the full Apache 2.0
license header from the `.github/license-header.tmpl` template file before the
shebang line. You can use the `make license-fmt` command to automatically apply
the correct license header to all shell scripts in the project.
- Around line 31-32: The issue is that `local rc=$?` captures the exit status of
the if statement itself (which is 0 when the condition is false) rather than the
actual exit code from the dkms install command. To fix this, restructure the
code to run the dkms install command directly and immediately capture its exit
code into rc before evaluating it. Then use an if statement to check if rc
equals 0 (success, return 0), or if rc equals 3 (acceptable error, continue),
otherwise exit with the captured error code. This way the actual dkms exit code
is properly evaluated instead of the if statement's exit status.
In `@nvidia-setup/skyhook_dir/steps/files/oke/oci-create-vfs`:
- Around line 15-16: All variable expansions in the script must be quoted to
prevent word-splitting issues. In the lines containing get_vf_dev_name function
call with parameters $interface and $vf_idx, and in the cat command path using
$vf_dev_name, wrap each variable expansion with double quotes (e.g., "$variable"
instead of $variable). This applies to all similar instances throughout the
script, including the sections around lines 21-22 and 49-55, ensuring every
variable expansion is properly quoted to comply with shellcheck SC2086.
- Line 2: The script is missing the `-u` flag in the set command for strict
mode. Update the set command from `set -ex -o pipefail` to `set -euo pipefail`
to ensure unset variable references are caught and the script follows proper
shell scripting guidelines for error handling.
In `@nvidia-setup/skyhook_dir/steps/files/oke/oci-sriov-vf-config`:
- Line 2: The script's initial set command at line 2 is missing the `-u` flag
for strict mode. Update the `set -e -o pipefail` statement to `set -euo
pipefail` to enable strict mode with unset variable checking. This ensures the
script will exit immediately if any undefined variables are referenced,
improving script robustness and catching potential errors early.
- Around line 11-16: Quote all variable expansions throughout the SRIOV
configuration block to prevent word-splitting issues. In the conditional check
at the start of the block, wrap $enabled and $vfs with double quotes. In the
echo statement that references the PCI device, wrap $pci with double quotes. In
the mlxconfig command invocation, wrap $pci with double quotes when passing it
as the device argument with the -d flag.
In `@nvidia-setup/skyhook_dir/steps/install_doca.sh`:
- Line 1: Replace the current shebang `#!/bin/bash` with `#!/usr/bin/env bash`
at the beginning of the script to follow strict shell conventions. Immediately
after the shebang line, add `set -euo pipefail` to enable strict mode which will
catch unset variable references and cause the script to exit on errors, ensuring
proper error handling throughout the install_doca.sh lifecycle script.
In `@nvidia-setup/skyhook_dir/steps/install_lustre_oke.sh`:
- Line 1: Replace the shebang line at the beginning of install_lustre_oke.sh
from the absolute path `/bin/bash` to `#!/usr/bin/env bash` instead. This uses
the env command to locate bash in the system PATH, making the script portable
across different systems where bash may be installed in non-standard locations.
- Line 19: The strict mode at the beginning of the install_lustre_oke.sh script
currently uses only `set -eo pipefail`, which does not catch references to unset
variables. This can hide typos in variable names that would otherwise cause
failures. Add the `-u` flag to the set command to enable strict mode for unset
variable checking, making it enforce all three error handling behaviors: exit on
errors, exit on pipe failures, and error on unset variables.
In `@nvidia-setup/skyhook_dir/steps/install_oci_hpc_packages.sh`:
- Line 1: The shebang line at the start of the install_oci_hpc_packages.sh
script uses `#!/bin/bash` which assumes bash is located at the hardcoded path
`/bin/bash`. Replace this with `#!/usr/bin/env bash` to use the portable shebang
that leverages the system's PATH to locate bash, making the script more portable
across different environments and Unix systems where bash may be installed in
different locations.
- Line 19: The set command in the install_oci_hpc_packages.sh script currently
uses `set -eo pipefail` but is missing the `-u` flag which is required by coding
guidelines. Modify the set command at the beginning of the script to include the
`-u` flag, changing it from `set -eo pipefail` to include all three flags in the
correct order to enable strict mode that catches unset variable references in
addition to errors and pipe failures.
In `@nvidia-setup/VERSION_OVERVIEW.md`:
- Around line 30-31: There is no blank line between the OKE table (ending with
the row containing gb200 and multiple Y values) and the 0.3.x heading. Insert a
blank line between the table that ends on line 30 and the heading "# 0.3.x" to
maintain proper markdown formatting and comply with the requirement to have
blank lines around headings as per the coding guidelines.
In `@nvidia-tuned/profiles/service/oke/script.sh`:
- Around line 1-7: The script does not conform to repository guidelines for
lifecycle scripts. Replace the shebang `#!/bin/bash` with `#!/usr/bin/env bash`
on the first line. Update the strict mode from `set -e` to `set -euo pipefail`
to enforce additional safety checks (undefined variable handling and pipe
failure detection) as required by the repository standards for lifecycle
scripts.
In `@tests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.sh`:
- Line 1: The shebang line at the beginning of the resolve_kernel_wrapper.sh
script uses #!/bin/bash, which is not portable across different systems. Replace
the shebang #!/bin/bash with #!/usr/bin/env bash to follow repository guidelines
and ensure better portability across different Unix-like systems where bash may
be installed in different locations.
- Around line 3-17: The file contains duplicate license headers with both the
SPDX short form (the SPDX-FileCopyrightText and SPDX-License-Identifier
comments) and the full Apache License 2.0 text block. Per repository guidelines,
keep only the SPDX short form header and remove the blank line and the entire
full Apache license block starting from "Licensed under the Apache License"
through the final "limitations under the License" closing comment. Ensure only
the SPDX header comments remain at the top of the resolve_kernel_wrapper.sh
file.
In `@tests/integration/nvidia_setup/test_apply_oke.py`:
- Line 164: The assertion on line 164 combines two conditions with an "and"
operator, which makes it unclear which condition failed when the test fails.
Split this compound assertion into two separate assert statements, one checking
that "memlock" is in conf and another checking that "unlimited" is in conf, so
that each condition produces a distinct failure message for easier debugging.
- Around line 3-17: The test_apply_oke.py file contains duplicate license
headers with both the SPDX short form at the beginning and a full Apache 2.0
license block below it. Per repository guidelines, only the SPDX short form
header should be retained. Remove the blank line following the SPDX header and
the entire Apache 2.0 license block (all lines from "Licensed under the Apache
License" through "limitations under the License."), keeping only the SPDX
FileCopyrightText and Identifier comment lines at the top.
---
Outside diff comments:
In `@nvidia-setup/skyhook_dir/load_defaults.sh`:
- Around line 19-23: The comment block that describes the contract for
load_defaults.sh (beginning with "Load nvidia-setup defaults and env overrides")
currently lists only the core variables in the "Sets and exports:" section, but
additional OKE-specific variables are being exported in the later section of the
file (around lines 46-50). Update the "Sets and exports:" line in the comment
block to include all the OKE-specific variables that are actually exported to
keep the documentation in sync with the actual code behavior.
In `@nvidia-setup/skyhook_dir/steps_check/configure_chrony_check.sh`:
- Line 1: The shebang line at the beginning of the script uses a hardcoded path
`#!/bin/bash` which is not portable across systems where bash may be installed
at different locations. Replace the shebang `#!/bin/bash` with `#!/usr/bin/env
bash` so that the script will automatically locate bash from the system's PATH,
ensuring compatibility across different Unix-like systems regardless of where
bash is installed.
In `@nvidia-setup/skyhook_dir/steps/configure-chrony.sh`:
- Line 1: Replace the shebang line in configure-chrony.sh with the portable
version that uses env to locate bash. Instead of using the hardcoded path
#!/bin/bash, change it to #!/usr/bin/env bash. This ensures the script will find
the bash interpreter regardless of where it's installed on the system, making
the script work across different system configurations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 51632c0d-e201-4d5e-8487-206f6755e474
⛔ Files ignored due to path filters (2)
nvidia-setup/CHANGELOG.mdis excluded by!**/CHANGELOG.mdnvidia-tuned/CHANGELOG.mdis excluded by!**/CHANGELOG.md
📒 Files selected for processing (39)
nvidia-setup/README.mdnvidia-setup/VERSION_OVERVIEW.mdnvidia-setup/config.jsonnvidia-setup/skyhook_dir/apply.shnvidia-setup/skyhook_dir/apply_check.shnvidia-setup/skyhook_dir/defaults/eks-gb200.confnvidia-setup/skyhook_dir/defaults/eks-h100.confnvidia-setup/skyhook_dir/defaults/oke-gb200.confnvidia-setup/skyhook_dir/defaults/oke-h100.confnvidia-setup/skyhook_dir/load_defaults.shnvidia-setup/skyhook_dir/steps/configure-chrony.shnvidia-setup/skyhook_dir/steps/configure_hpc_networking.shnvidia-setup/skyhook_dir/steps/configure_limits.shnvidia-setup/skyhook_dir/steps/ensure_kernel.shnvidia-setup/skyhook_dir/steps/files/oke/99-oci-network-mlx.rulesnvidia-setup/skyhook_dir/steps/files/oke/fix-ofed-symlinks.shnvidia-setup/skyhook_dir/steps/files/oke/lustre-modules-setupnvidia-setup/skyhook_dir/steps/files/oke/lustre-modules-setup.servicenvidia-setup/skyhook_dir/steps/files/oke/oci-create-vfsnvidia-setup/skyhook_dir/steps/files/oke/oci-sriov-vf-confignvidia-setup/skyhook_dir/steps/files/oke/rdma_network.jsonnvidia-setup/skyhook_dir/steps/install_doca.shnvidia-setup/skyhook_dir/steps/install_kernel.shnvidia-setup/skyhook_dir/steps/install_lustre_oke.shnvidia-setup/skyhook_dir/steps/install_oci_hpc_packages.shnvidia-setup/skyhook_dir/steps_check/configure_chrony_check.shnvidia-setup/skyhook_dir/steps_check/configure_hpc_networking_check.shnvidia-setup/skyhook_dir/steps_check/configure_limits_check.shnvidia-setup/skyhook_dir/steps_check/install_doca_check.shnvidia-setup/skyhook_dir/steps_check/install_lustre_oke_check.shnvidia-setup/skyhook_dir/steps_check/install_oci_hpc_packages_check.shnvidia-setup/skyhook_dir/utilities.shnvidia-tuned/README.mdnvidia-tuned/config.jsonnvidia-tuned/profiles/service/oke/script.shnvidia-tuned/profiles/service/oke/tuned.conf.templatetests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.shtests/integration/nvidia_setup/test_apply_oke.pytests/integration/nvidia_tuned/test_prepare_nvidia_profiles.py
| 1. **upgrade** – `apt-get update && apt-get upgrade -y` | ||
| 2. **install_doca** – Mellanox DOCA/OFED + `mstflint` from the DOCA repo (idempotent; fixes OFED symlinks) | ||
| 3. **install_oci_hpc_packages** – downloads + installs the OCI HPC `network-device-names` and `nvidia-gpu-configure` debs (matches Oracle's modern `use_plugins` set) | ||
| 4. **configure_hpc_networking** – drops the OCI RDMA udev rules, SR-IOV VF helpers, and the OCA `rdma_network.json` |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix OCI acronym typo in OKE networking step.
Line 94 says OCA rdma_network.json; this should be OCI to match the rest of the OKE/OCI documentation and avoid confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/README.md` at line 94, The configure_hpc_networking section in
the README.md contains an incorrect acronym "OCA" that should be "OCI" for
consistency with OKE and OCI documentation standards. Locate the line describing
the configure_hpc_networking step that mentions "OCA rdma_network.json" and
change "OCA" to "OCI" to match the proper Oracle Cloud Infrastructure
terminology used throughout the documentation.
| LUSTRE_DKMS_NAME=lustre-client-modules | ||
| LUSTRE_DKMS_VERSION=2.15.90-v35817661 | ||
| LUSTRE_PKG_VERSION=2.15.90-v35817661-1 | ||
| LUSTRE_ARTIFACTS_URL=REPLACE_WITH_OCI_LUSTRE_DEB_URL |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Replace the Lustre artifacts placeholder with a deployable value.
LUSTRE_ARTIFACTS_URL is still a placeholder, so OKE GB200 Lustre bootstrap cannot fetch packages in a real run.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/skyhook_dir/defaults/oke-gb200.conf` at line 11, The
LUSTRE_ARTIFACTS_URL variable in oke-gb200.conf is currently set to a
placeholder string REPLACE_WITH_OCI_LUSTRE_DEB_URL instead of an actual
deployable URL. Replace this placeholder value with the correct OCI Lustre DEB
package repository URL so that the OKE GB200 Lustre bootstrap can properly fetch
packages during deployment.
| LUSTRE_DKMS_NAME=lustre-client-modules | ||
| LUSTRE_DKMS_VERSION=2.15.90-v35817661 | ||
| LUSTRE_PKG_VERSION=2.15.90-v35817661-1 | ||
| LUSTRE_ARTIFACTS_URL=REPLACE_WITH_OCI_LUSTRE_DEB_URL |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Do not ship OKE H100 defaults with a Lustre URL placeholder.
Line 11 keeps a non-resolvable placeholder, which breaks the default Lustre package retrieval path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/skyhook_dir/defaults/oke-h100.conf` at line 11, Replace the
placeholder value in the LUSTRE_ARTIFACTS_URL variable with an actual,
resolvable OCI Lustre DEB package URL. The current placeholder value
REPLACE_WITH_OCI_LUSTRE_DEB_URL needs to be replaced with the correct URL that
points to the OCI Lustre DEB artifacts repository so that the default Lustre
package retrieval works properly in the OKE H100 configuration.
| @@ -0,0 +1,30 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use the portable shebang for consistency.
The shebang should be #!/usr/bin/env bash per coding guidelines. As per coding guidelines, all shell scripts should use the portable env form to locate bash.
🔧 Proposed fix
-#!/bin/bash
+#!/usr/bin/env bash📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| #!/usr/bin/env bash |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/skyhook_dir/steps_check/configure_hpc_networking_check.sh` at
line 1, The shebang at the start of the configure_hpc_networking_check.sh script
uses the direct bash path form which is not portable across different systems.
Replace the current shebang `#!/bin/bash` with the portable form `#!/usr/bin/env
bash` which uses the env utility to locate bash in the system's PATH, ensuring
the script works consistently across different environments following the
established coding guidelines.
Source: Coding guidelines
| @@ -0,0 +1,24 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Use #!/usr/bin/env bash shebang instead of /bin/bash.
This makes the script portable across systems where bash may be installed at different paths.
As per coding guidelines: Shell scripts must use shebang #!/usr/bin/env bash.
🐛 Fix shebang
-#!/bin/bash
+#!/usr/bin/env bash📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| #!/usr/bin/env bash |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/skyhook_dir/steps_check/configure_limits_check.sh` at line 1,
The shebang line in the configure_limits_check.sh script uses a hardcoded path
to bash which may not be portable across all systems. Replace the shebang
`#!/bin/bash` with `#!/usr/bin/env bash` at the first line of the file to ensure
the script can find and execute bash regardless of its installation location on
the system.
Source: Coding guidelines
| #!/bin/bash | ||
|
|
||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # TuneD script plugin lifecycle: start | stop [full_rollback] | verify [ignore_missing] | ||
| set -e |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Update shebang and strict mode to match repository guidelines.
The script uses #!/bin/bash and set -e, but repository guidelines require #!/usr/bin/env bash and set -euo pipefail for all lifecycle scripts (excluding -check scripts).
🔧 Proposed fix
-#!/bin/bash
+#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# TuneD script plugin lifecycle: start | stop [full_rollback] | verify [ignore_missing]
-set -e
+set -euo pipefail🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-tuned/profiles/service/oke/script.sh` around lines 1 - 7, The script
does not conform to repository guidelines for lifecycle scripts. Replace the
shebang `#!/bin/bash` with `#!/usr/bin/env bash` on the first line. Update the
strict mode from `set -e` to `set -euo pipefail` to enforce additional safety
checks (undefined variable handling and pipe failure detection) as required by
the repository standards for lifecycle scripts.
Source: Coding guidelines
| @@ -0,0 +1,20 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use portable shebang.
The script uses #!/bin/bash but repository guidelines require #!/usr/bin/env bash for portability.
🔧 Proposed fix
-#!/bin/bash
+#!/usr/bin/env bash📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| #!/usr/bin/env bash |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.sh` at line 1,
The shebang line at the beginning of the resolve_kernel_wrapper.sh script uses
#!/bin/bash, which is not portable across different systems. Replace the shebang
#!/bin/bash with #!/usr/bin/env bash to follow repository guidelines and ensure
better portability across different Unix-like systems where bash may be
installed in different locations.
Source: Coding guidelines
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔴 Critical | ⚡ Quick win
Remove duplicate license header.
This file has both the SPDX short form (lines 3-4) and the full Apache 2.0 license block (lines 7-17). Per repository guidelines, source files should carry only the SPDX short form added by make license-fmt, and CI will fail on duplicated headers.
Remove lines 5-17 (the blank line and full Apache block), keeping only the SPDX header at lines 3-4.
🔧 Proposed fix
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
-#
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
. "${SKYHOOK_DIR}/skyhook_dir/utilities.sh"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| . "${SKYHOOK_DIR}/skyhook_dir/utilities.sh" | |
| resolve_full_kernel "$1" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.sh` around
lines 3 - 17, The file contains duplicate license headers with both the SPDX
short form (the SPDX-FileCopyrightText and SPDX-License-Identifier comments) and
the full Apache License 2.0 text block. Per repository guidelines, keep only the
SPDX short form header and remove the blank line and the entire full Apache
license block starting from "Licensed under the Apache License" through the
final "limitations under the License" closing comment. Ensure only the SPDX
header comments remain at the top of the resolve_kernel_wrapper.sh file.
Source: Coding guidelines
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔴 Critical | ⚡ Quick win
Remove duplicate license header.
This file has both the SPDX short form (lines 3-4) and the full Apache 2.0 license block (lines 7-17). Per repository guidelines, source files should carry only the SPDX short form, and CI will fail on duplicated headers.
Remove lines 5-17 (the blank line and full Apache block), keeping only the SPDX header at lines 3-4.
🔧 Proposed fix
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
-#
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
"""Tests for nvidia-setup oke flavor."""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| """Tests for nvidia-setup oke flavor.""" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/nvidia_setup/test_apply_oke.py` around lines 3 - 17, The
test_apply_oke.py file contains duplicate license headers with both the SPDX
short form at the beginning and a full Apache 2.0 license block below it. Per
repository guidelines, only the SPDX short form header should be retained.
Remove the blank line following the SPDX header and the entire Apache 2.0
license block (all lines from "Licensed under the Apache License" through
"limitations under the License."), keeping only the SPDX FileCopyrightText and
Identifier comment lines at the top.
Source: Coding guidelines
| ) | ||
| assert_exit_code(result, 0) | ||
| conf = runner.get_file_contents("/etc/security/limits.d/99-oci-hpc.conf") | ||
| assert "memlock" in conf and "unlimited" in conf |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
Break down compound assertion.
The assertion on line 164 checks two conditions in one statement, making it harder to debug which part failed. Split into separate assertions for clearer failure messages.
♻️ Proposed refactor
conf = runner.get_file_contents("/etc/security/limits.d/99-oci-hpc.conf")
- assert "memlock" in conf and "unlimited" in conf
+ assert "memlock" in conf
+ assert "unlimited" in conf📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert "memlock" in conf and "unlimited" in conf | |
| conf = runner.get_file_contents("/etc/security/limits.d/99-oci-hpc.conf") | |
| assert "memlock" in conf | |
| assert "unlimited" in conf |
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 164-164: Assertion should be broken down into multiple parts
Break down assertion into multiple parts
(PT018)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/nvidia_setup/test_apply_oke.py` at line 164, The assertion
on line 164 combines two conditions with an "and" operator, which makes it
unclear which condition failed when the test fails. Split this compound
assertion into two separate assert statements, one checking that "memlock" is in
conf and another checking that "unlimited" is in conf, so that each condition
produces a distinct failure message for easier debugging.
Source: Linters/SAST tools
Summary
Adds a
service=okeflavor (acceleratorsh100,gb200) to nvidia-setup and nvidia-tuned, translating the NV-specific layer of an Oracle OKE GPU/HPC worker (fromoke-node-images) into runtime Skyhook packages. Design + plans are committed underdocs/superpowers/.Scope boundary: only the NV/HPC/storage/perf layer is translated. The base OKE worker bootstrap (oke-init, kubelet, cri-o, VNIC config) stays in the OKE base image and is intentionally out of scope.
nvidia-setup (
oke-h100,oke-gb200) —0.2.2→0.4.0okekernel verify/install: pinned6.8.0-1041-oracle(x86) or thelinux-nvidia-64kvendor meta-kernel (Grace/arm64), with the existing install+reboot/post-interrupt flow.install_doca(DOCA/OFED + mstflint),install_oci_hpc_packages(device-naming + gpu-configure debs),configure_hpc_networking(OCI udev/VF helpers + OCArdma_network.json),install_lustre_oke(client debs + DKMS + IMDS-gatedlustre-modules-setuploader/unit),configure_limits(memlock unlimited), and a parametrizedconfigure-chrony(OCI IMDS NTP).oke.nvidia-tuned (
service=oke) —0.3.0→0.4.0profiles/service/oke/(tuned.conf.template+script.sh) layering the OCI HPC[bootloader]cmdline (oci_hpc.*device naming, cstate/mitigations,pcie_ports=dpc-native) and[sysctl]deltas (incl. large-clustergc_thresh*+perf_event_paranoid) plus MAC-address policy on top of the existingnvidia-{accel}-{intent}profiles. No prepare/apply script changes. De-duped againstnvidia-base/nvidia-acs-disable/nvidia-h100-performance.Cross-checked against Oracle's public
oci-hpc-imagesValidated the translation against
oracle-quickstart/oci-hpc-images(Ubuntu-24 amd64/arm64). Folded in thegc_thresh*/perf_event_paranoidsysctls,pcie_ports=dpc-native, andmemlockulimits; documented open items in the spec.Open items (documented)
LUSTRE_ARTIFACTS_URLis a placeholder indefaults/oke-*.conf(tests useSKIP_SYSTEM_OPERATIONS); set it once the OCI artifact location is provided.3.3.0(oke-node-images) vs3.2.1(Oracle public) — chosen deliberately.Test plan
make validate-standalone PACKAGE=nvidia-setupmake validate-inherited PACKAGE=nvidia-tunedoketests pass on ubuntu-24.04, ubuntu-22.04, debian-12, rocky-9 (verified per-distro; the full parallel matrix only flakes on environmental apt/tuned-install timeouts)