Skip to content

feat: add OKE (OCI) flavor to nvidia-setup and nvidia-tuned#49

Open
ayuskauskas wants to merge 16 commits into
mainfrom
feat/nvidia-oke
Open

feat: add OKE (OCI) flavor to nvidia-setup and nvidia-tuned#49
ayuskauskas wants to merge 16 commits into
mainfrom
feat/nvidia-oke

Conversation

@ayuskauskas

Copy link
Copy Markdown
Collaborator

Summary

Adds a service=oke flavor (accelerators h100, gb200) to nvidia-setup and nvidia-tuned, translating the NV-specific layer of an Oracle OKE GPU/HPC worker (from oke-node-images) into runtime Skyhook packages. Design + plans are committed under docs/superpowers/.

Scope boundary: only the NV/HPC/storage/perf layer is translated. The base OKE worker bootstrap (oke-init, kubelet, cri-o, VNIC config) stays in the OKE base image and is intentionally out of scope.

nvidia-setup (oke-h100, oke-gb200) — 0.2.20.4.0

  • oke kernel verify/install: pinned 6.8.0-1041-oracle (x86) or the linux-nvidia-64k vendor meta-kernel (Grace/arm64), with the existing install+reboot/post-interrupt flow.
  • New steps (+ matching checks): install_doca (DOCA/OFED + mstflint), install_oci_hpc_packages (device-naming + gpu-configure debs), configure_hpc_networking (OCI udev/VF helpers + OCA rdma_network.json), install_lustre_oke (client debs + DKMS + IMDS-gated lustre-modules-setup loader/unit), configure_limits (memlock unlimited), and a parametrized configure-chrony (OCI IMDS NTP).
  • EFA/OFI are AWS-only and intentionally absent for oke.

nvidia-tuned (service=oke) — 0.3.00.4.0

  • New profiles/service/oke/ (tuned.conf.template + script.sh) layering the OCI HPC [bootloader] cmdline (oci_hpc.* device naming, cstate/mitigations, pcie_ports=dpc-native) and [sysctl] deltas (incl. large-cluster gc_thresh* + perf_event_paranoid) plus MAC-address policy on top of the existing nvidia-{accel}-{intent} profiles. No prepare/apply script changes. De-duped against nvidia-base/nvidia-acs-disable/nvidia-h100-performance.

Cross-checked against Oracle's public oci-hpc-images

Validated the translation against oracle-quickstart/oci-hpc-images (Ubuntu-24 amd64/arm64). Folded in the gc_thresh*/perf_event_paranoid sysctls, pcie_ports=dpc-native, and memlock ulimits; documented open items in the spec.

Open items (documented)

  • LUSTRE_ARTIFACTS_URL is a placeholder in defaults/oke-*.conf (tests use SKIP_SYSTEM_OPERATIONS); set it once the OCI artifact location is provided.
  • IOMMU/ACS is hardware-specific (oracle vs Oracle public vs CX8/GB200) and intentionally left to the workload profile — confirm the GB200/CX8 value before shipping a gb200 override.
  • DOCA version 3.3.0 (oke-node-images) vs 3.2.1 (Oracle public) — chosen deliberately.

Test plan

  • nvidia-setup full suite: 30 passed, 2 skipped
  • make validate-standalone PACKAGE=nvidia-setup
  • make validate-inherited PACKAGE=nvidia-tuned
  • nvidia-tuned oke tests pass on ubuntu-24.04, ubuntu-22.04, debian-12, rocky-9 (verified per-distro; the full parallel matrix only flakes on environmental apt/tuned-install timeouts)
  • Validate on real OKE H100 / GB200 hardware (RDMA naming, Lustre mount, kernel)

@lockwobr lockwobr requested a review from a team June 23, 2026 18:50
@github-actions github-actions Bot added doc Documentation addition, correction, or improvement package/nvidia-tuned package/nvidia-setup tests labels Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces OKE (Oracle Kubernetes Engine / OCI) platform support for nvidia-setup and nvidia-tuned, bumping both packages from version 0.3.0 to 0.4.0. On the nvidia-setup side, new oke-h100.conf and oke-gb200.conf defaults files pin kernel, DOCA, OCI HPC package, Lustre, and NTP values. load_defaults.sh exports ten new OKE-specific variables. New installation steps are added for DOCA/OFED (with an OFED symlink fixer), OCI HPC packages (network-device-names and GPU-configure debs), HPC networking configuration (udev rules, SR-IOV VF scripts, RDMA network JSON), a Lustre client DKMS bootstrapper with Oracle IMDS-driven conditional loading and LNet configuration, system limits (/etc/security/limits.d/99-oci-hpc.conf), and a parameterized chrony configuration step. ensure_kernel.sh and install_kernel.sh gain a KERNEL_META branch for meta-package-based kernel matching. utilities.sh now preserves -oracle and -nvidia-64k kernel flavor strings. apply.sh and apply_check.sh add oke-h100/oke-gb200 dispatch cases. On the nvidia-tuned side, a new service/oke/ profile is added with a full tuned.conf.template (bootloader cmdline + sysctl block) and script.sh lifecycle dispatcher. Integration tests and documentation updates accompany all changes.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add OKE (OCI) flavor to nvidia-setup and nvidia-tuned' directly and concisely summarizes the main objective of the changeset: adding OKE/OCI support to both packages.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the OKE flavor additions, version bumps, new steps/scripts, cross-validation, and open items for both nvidia-setup and nvidia-tuned.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/nvidia-oke

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 31

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
nvidia-setup/skyhook_dir/load_defaults.sh (1)

19-23: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Update the in-file export contract comment to match the new environment surface.

The comment block still says only core vars are exported, but Lines 46-50 now export many OKE-specific variables. Keep these in sync to avoid drift.

Also applies to: 46-50

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/load_defaults.sh` around lines 19 - 23, The comment
block that describes the contract for load_defaults.sh (beginning with "Load
nvidia-setup defaults and env overrides") currently lists only the core
variables in the "Sets and exports:" section, but additional OKE-specific
variables are being exported in the later section of the file (around lines
46-50). Update the "Sets and exports:" line in the comment block to include all
the OKE-specific variables that are actually exported to keep the documentation
in sync with the actual code behavior.
nvidia-setup/skyhook_dir/steps/configure-chrony.sh (1)

1-1: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Use #!/usr/bin/env bash shebang instead of /bin/bash.

This makes the script portable across systems where bash may be installed at different paths.

As per coding guidelines: Shell scripts must use shebang #!/usr/bin/env bash.

🐛 Fix shebang
-#!/bin/bash
+#!/usr/bin/env bash
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/steps/configure-chrony.sh` at line 1, Replace the
shebang line in configure-chrony.sh with the portable version that uses env to
locate bash. Instead of using the hardcoded path #!/bin/bash, change it to
#!/usr/bin/env bash. This ensures the script will find the bash interpreter
regardless of where it's installed on the system, making the script work across
different system configurations.

Source: Coding guidelines

nvidia-setup/skyhook_dir/steps_check/configure_chrony_check.sh (1)

1-1: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Use #!/usr/bin/env bash shebang instead of /bin/bash.

This makes the script portable across systems where bash may be installed at different paths.

As per coding guidelines: Shell scripts must use shebang #!/usr/bin/env bash.

🐛 Fix shebang
-#!/bin/bash
+#!/usr/bin/env bash
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/steps_check/configure_chrony_check.sh` at line 1,
The shebang line at the beginning of the script uses a hardcoded path
`#!/bin/bash` which is not portable across systems where bash may be installed
at different locations. Replace the shebang `#!/bin/bash` with `#!/usr/bin/env
bash` so that the script will automatically locate bash from the system's PATH,
ensuring compatibility across different Unix-like systems regardless of where
bash is installed.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nvidia-setup/README.md`:
- Line 94: The configure_hpc_networking section in the README.md contains an
incorrect acronym "OCA" that should be "OCI" for consistency with OKE and OCI
documentation standards. Locate the line describing the configure_hpc_networking
step that mentions "OCA rdma_network.json" and change "OCA" to "OCI" to match
the proper Oracle Cloud Infrastructure terminology used throughout the
documentation.

In `@nvidia-setup/skyhook_dir/defaults/oke-gb200.conf`:
- Line 11: The LUSTRE_ARTIFACTS_URL variable in oke-gb200.conf is currently set
to a placeholder string REPLACE_WITH_OCI_LUSTRE_DEB_URL instead of an actual
deployable URL. Replace this placeholder value with the correct OCI Lustre DEB
package repository URL so that the OKE GB200 Lustre bootstrap can properly fetch
packages during deployment.

In `@nvidia-setup/skyhook_dir/defaults/oke-h100.conf`:
- Line 11: Replace the placeholder value in the LUSTRE_ARTIFACTS_URL variable
with an actual, resolvable OCI Lustre DEB package URL. The current placeholder
value REPLACE_WITH_OCI_LUSTRE_DEB_URL needs to be replaced with the correct URL
that points to the OCI Lustre DEB artifacts repository so that the default
Lustre package retrieval works properly in the OKE H100 configuration.

In `@nvidia-setup/skyhook_dir/steps_check/configure_hpc_networking_check.sh`:
- Line 1: The shebang at the start of the configure_hpc_networking_check.sh
script uses the direct bash path form which is not portable across different
systems. Replace the current shebang `#!/bin/bash` with the portable form
`#!/usr/bin/env bash` which uses the env utility to locate bash in the system's
PATH, ensuring the script works consistently across different environments
following the established coding guidelines.

In `@nvidia-setup/skyhook_dir/steps_check/configure_limits_check.sh`:
- Line 1: The shebang line in the configure_limits_check.sh script uses a
hardcoded path to bash which may not be portable across all systems. Replace the
shebang `#!/bin/bash` with `#!/usr/bin/env bash` at the first line of the file
to ensure the script can find and execute bash regardless of its installation
location on the system.

In `@nvidia-setup/skyhook_dir/steps_check/install_doca_check.sh`:
- Line 1: The shebang in the install_doca_check.sh script is using the absolute
path form #!/bin/bash which is not portable across different systems. Replace
the shebang line at the beginning of the file with #!/usr/bin/env bash to use
the portable env form that will locate bash regardless of its installation path,
ensuring consistency with coding guidelines.

In `@nvidia-setup/skyhook_dir/steps_check/install_lustre_oke_check.sh`:
- Line 1: Replace the shebang line at the beginning of the
install_lustre_oke_check.sh script from `#!/bin/bash` to `#!/usr/bin/env bash`.
This makes the script portable across systems where bash may be installed at
different paths by leveraging the env command to locate bash dynamically rather
than assuming it is at /bin/bash.

In `@nvidia-setup/skyhook_dir/steps_check/install_oci_hpc_packages_check.sh`:
- Line 1: The shebang line at the beginning of the
install_oci_hpc_packages_check.sh script uses the hardcoded path #!/bin/bash
which is not portable across different systems. Replace the shebang with
#!/usr/bin/env bash to follow coding guidelines and ensure portability, as this
method uses the env utility to locate bash dynamically rather than assuming it
exists at a fixed path.

In `@nvidia-setup/skyhook_dir/steps/configure_hpc_networking.sh`:
- Line 1: The shebang line at the start of configure_hpc_networking.sh uses
`#!/bin/bash` which is not portable across different systems where bash may be
installed in non-standard locations. Replace the shebang with `#!/usr/bin/env
bash` which uses the env command to locate bash in the system PATH, making the
script portable across different Unix-like systems and environments where bash
installation locations may vary.
- Line 19: The shell script is missing the `-u` flag in the strict mode
settings. In the configure_hpc_networking.sh script, update the `set -eo
pipefail` command to include the `-u` flag, changing it to `set -euo pipefail`.
This addition ensures that any references to unset variables will cause the
script to exit immediately, improving error detection and debugging.

In `@nvidia-setup/skyhook_dir/steps/configure_limits.sh`:
- Line 19: The strict mode configuration at the beginning of the
configure_limits.sh script uses `set -eo pipefail` which does not enable
checking for unset variables. Update this command to include the `-u` flag by
changing it to `set -euo pipefail`. This will enable the script to catch unset
variables and provide more comprehensive error handling aligned with the coding
guidelines for strict mode.
- Line 1: Replace the shebang line in configure_limits.sh from `#!/bin/bash` to
`#!/usr/bin/env bash` to ensure portability across systems where bash may be
installed at different paths. The shebang is the first line of the script and
should be updated to use the env command to locate bash dynamically.

In `@nvidia-setup/skyhook_dir/steps/ensure_kernel.sh`:
- Around line 78-84: The flavor extraction using parameter expansion
`${KERNEL##*linux-}` silently produces an empty string if KERNEL does not
contain "linux-", causing the subsequent grep validation to always succeed
incorrectly (since grep -q "" matches everything). Before performing the flavor
extraction in the Meta/vendor kernels validation block, add a check to ensure
that KERNEL actually contains the "linux-" substring, and if it doesn't, exit
with an appropriate error message indicating that KERNEL format is invalid. This
ensures that the validation only proceeds when flavor extraction results in a
non-empty string.

In `@nvidia-setup/skyhook_dir/steps/files/oke/fix-ofed-symlinks.sh`:
- Line 1: The shell script fix-ofed-symlinks.sh is missing the required Apache
2.0 license header at the beginning of the file. Add the full Apache 2.0 license
header from .github/license-header.tmpl to the top of the script before the
shebang line. You can run the make license-fmt command from the repository root
to automatically add the required header to this file.

In `@nvidia-setup/skyhook_dir/steps/files/oke/lustre-modules-setup`:
- Around line 1-5: The lustre-modules-setup shell script is missing the required
Apache 2.0 license header at the beginning of the file. Add the full Apache 2.0
license header from the `.github/license-header.tmpl` template file before the
shebang line. You can use the `make license-fmt` command to automatically apply
the correct license header to all shell scripts in the project.
- Around line 31-32: The issue is that `local rc=$?` captures the exit status of
the if statement itself (which is 0 when the condition is false) rather than the
actual exit code from the dkms install command. To fix this, restructure the
code to run the dkms install command directly and immediately capture its exit
code into rc before evaluating it. Then use an if statement to check if rc
equals 0 (success, return 0), or if rc equals 3 (acceptable error, continue),
otherwise exit with the captured error code. This way the actual dkms exit code
is properly evaluated instead of the if statement's exit status.

In `@nvidia-setup/skyhook_dir/steps/files/oke/oci-create-vfs`:
- Around line 15-16: All variable expansions in the script must be quoted to
prevent word-splitting issues. In the lines containing get_vf_dev_name function
call with parameters $interface and $vf_idx, and in the cat command path using
$vf_dev_name, wrap each variable expansion with double quotes (e.g., "$variable"
instead of $variable). This applies to all similar instances throughout the
script, including the sections around lines 21-22 and 49-55, ensuring every
variable expansion is properly quoted to comply with shellcheck SC2086.
- Line 2: The script is missing the `-u` flag in the set command for strict
mode. Update the set command from `set -ex -o pipefail` to `set -euo pipefail`
to ensure unset variable references are caught and the script follows proper
shell scripting guidelines for error handling.

In `@nvidia-setup/skyhook_dir/steps/files/oke/oci-sriov-vf-config`:
- Line 2: The script's initial set command at line 2 is missing the `-u` flag
for strict mode. Update the `set -e -o pipefail` statement to `set -euo
pipefail` to enable strict mode with unset variable checking. This ensures the
script will exit immediately if any undefined variables are referenced,
improving script robustness and catching potential errors early.
- Around line 11-16: Quote all variable expansions throughout the SRIOV
configuration block to prevent word-splitting issues. In the conditional check
at the start of the block, wrap $enabled and $vfs with double quotes. In the
echo statement that references the PCI device, wrap $pci with double quotes. In
the mlxconfig command invocation, wrap $pci with double quotes when passing it
as the device argument with the -d flag.

In `@nvidia-setup/skyhook_dir/steps/install_doca.sh`:
- Line 1: Replace the current shebang `#!/bin/bash` with `#!/usr/bin/env bash`
at the beginning of the script to follow strict shell conventions. Immediately
after the shebang line, add `set -euo pipefail` to enable strict mode which will
catch unset variable references and cause the script to exit on errors, ensuring
proper error handling throughout the install_doca.sh lifecycle script.

In `@nvidia-setup/skyhook_dir/steps/install_lustre_oke.sh`:
- Line 1: Replace the shebang line at the beginning of install_lustre_oke.sh
from the absolute path `/bin/bash` to `#!/usr/bin/env bash` instead. This uses
the env command to locate bash in the system PATH, making the script portable
across different systems where bash may be installed in non-standard locations.
- Line 19: The strict mode at the beginning of the install_lustre_oke.sh script
currently uses only `set -eo pipefail`, which does not catch references to unset
variables. This can hide typos in variable names that would otherwise cause
failures. Add the `-u` flag to the set command to enable strict mode for unset
variable checking, making it enforce all three error handling behaviors: exit on
errors, exit on pipe failures, and error on unset variables.

In `@nvidia-setup/skyhook_dir/steps/install_oci_hpc_packages.sh`:
- Line 1: The shebang line at the start of the install_oci_hpc_packages.sh
script uses `#!/bin/bash` which assumes bash is located at the hardcoded path
`/bin/bash`. Replace this with `#!/usr/bin/env bash` to use the portable shebang
that leverages the system's PATH to locate bash, making the script more portable
across different environments and Unix systems where bash may be installed in
different locations.
- Line 19: The set command in the install_oci_hpc_packages.sh script currently
uses `set -eo pipefail` but is missing the `-u` flag which is required by coding
guidelines. Modify the set command at the beginning of the script to include the
`-u` flag, changing it from `set -eo pipefail` to include all three flags in the
correct order to enable strict mode that catches unset variable references in
addition to errors and pipe failures.

In `@nvidia-setup/VERSION_OVERVIEW.md`:
- Around line 30-31: There is no blank line between the OKE table (ending with
the row containing gb200 and multiple Y values) and the 0.3.x heading. Insert a
blank line between the table that ends on line 30 and the heading "# 0.3.x" to
maintain proper markdown formatting and comply with the requirement to have
blank lines around headings as per the coding guidelines.

In `@nvidia-tuned/profiles/service/oke/script.sh`:
- Around line 1-7: The script does not conform to repository guidelines for
lifecycle scripts. Replace the shebang `#!/bin/bash` with `#!/usr/bin/env bash`
on the first line. Update the strict mode from `set -e` to `set -euo pipefail`
to enforce additional safety checks (undefined variable handling and pipe
failure detection) as required by the repository standards for lifecycle
scripts.

In `@tests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.sh`:
- Line 1: The shebang line at the beginning of the resolve_kernel_wrapper.sh
script uses #!/bin/bash, which is not portable across different systems. Replace
the shebang #!/bin/bash with #!/usr/bin/env bash to follow repository guidelines
and ensure better portability across different Unix-like systems where bash may
be installed in different locations.
- Around line 3-17: The file contains duplicate license headers with both the
SPDX short form (the SPDX-FileCopyrightText and SPDX-License-Identifier
comments) and the full Apache License 2.0 text block. Per repository guidelines,
keep only the SPDX short form header and remove the blank line and the entire
full Apache license block starting from "Licensed under the Apache License"
through the final "limitations under the License" closing comment. Ensure only
the SPDX header comments remain at the top of the resolve_kernel_wrapper.sh
file.

In `@tests/integration/nvidia_setup/test_apply_oke.py`:
- Line 164: The assertion on line 164 combines two conditions with an "and"
operator, which makes it unclear which condition failed when the test fails.
Split this compound assertion into two separate assert statements, one checking
that "memlock" is in conf and another checking that "unlimited" is in conf, so
that each condition produces a distinct failure message for easier debugging.
- Around line 3-17: The test_apply_oke.py file contains duplicate license
headers with both the SPDX short form at the beginning and a full Apache 2.0
license block below it. Per repository guidelines, only the SPDX short form
header should be retained. Remove the blank line following the SPDX header and
the entire Apache 2.0 license block (all lines from "Licensed under the Apache
License" through "limitations under the License."), keeping only the SPDX
FileCopyrightText and Identifier comment lines at the top.

---

Outside diff comments:
In `@nvidia-setup/skyhook_dir/load_defaults.sh`:
- Around line 19-23: The comment block that describes the contract for
load_defaults.sh (beginning with "Load nvidia-setup defaults and env overrides")
currently lists only the core variables in the "Sets and exports:" section, but
additional OKE-specific variables are being exported in the later section of the
file (around lines 46-50). Update the "Sets and exports:" line in the comment
block to include all the OKE-specific variables that are actually exported to
keep the documentation in sync with the actual code behavior.

In `@nvidia-setup/skyhook_dir/steps_check/configure_chrony_check.sh`:
- Line 1: The shebang line at the beginning of the script uses a hardcoded path
`#!/bin/bash` which is not portable across systems where bash may be installed
at different locations. Replace the shebang `#!/bin/bash` with `#!/usr/bin/env
bash` so that the script will automatically locate bash from the system's PATH,
ensuring compatibility across different Unix-like systems regardless of where
bash is installed.

In `@nvidia-setup/skyhook_dir/steps/configure-chrony.sh`:
- Line 1: Replace the shebang line in configure-chrony.sh with the portable
version that uses env to locate bash. Instead of using the hardcoded path
#!/bin/bash, change it to #!/usr/bin/env bash. This ensures the script will find
the bash interpreter regardless of where it's installed on the system, making
the script work across different system configurations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 51632c0d-e201-4d5e-8487-206f6755e474

📥 Commits

Reviewing files that changed from the base of the PR and between 46116a7 and 675c01e.

⛔ Files ignored due to path filters (2)
  • nvidia-setup/CHANGELOG.md is excluded by !**/CHANGELOG.md
  • nvidia-tuned/CHANGELOG.md is excluded by !**/CHANGELOG.md
📒 Files selected for processing (39)
  • nvidia-setup/README.md
  • nvidia-setup/VERSION_OVERVIEW.md
  • nvidia-setup/config.json
  • nvidia-setup/skyhook_dir/apply.sh
  • nvidia-setup/skyhook_dir/apply_check.sh
  • nvidia-setup/skyhook_dir/defaults/eks-gb200.conf
  • nvidia-setup/skyhook_dir/defaults/eks-h100.conf
  • nvidia-setup/skyhook_dir/defaults/oke-gb200.conf
  • nvidia-setup/skyhook_dir/defaults/oke-h100.conf
  • nvidia-setup/skyhook_dir/load_defaults.sh
  • nvidia-setup/skyhook_dir/steps/configure-chrony.sh
  • nvidia-setup/skyhook_dir/steps/configure_hpc_networking.sh
  • nvidia-setup/skyhook_dir/steps/configure_limits.sh
  • nvidia-setup/skyhook_dir/steps/ensure_kernel.sh
  • nvidia-setup/skyhook_dir/steps/files/oke/99-oci-network-mlx.rules
  • nvidia-setup/skyhook_dir/steps/files/oke/fix-ofed-symlinks.sh
  • nvidia-setup/skyhook_dir/steps/files/oke/lustre-modules-setup
  • nvidia-setup/skyhook_dir/steps/files/oke/lustre-modules-setup.service
  • nvidia-setup/skyhook_dir/steps/files/oke/oci-create-vfs
  • nvidia-setup/skyhook_dir/steps/files/oke/oci-sriov-vf-config
  • nvidia-setup/skyhook_dir/steps/files/oke/rdma_network.json
  • nvidia-setup/skyhook_dir/steps/install_doca.sh
  • nvidia-setup/skyhook_dir/steps/install_kernel.sh
  • nvidia-setup/skyhook_dir/steps/install_lustre_oke.sh
  • nvidia-setup/skyhook_dir/steps/install_oci_hpc_packages.sh
  • nvidia-setup/skyhook_dir/steps_check/configure_chrony_check.sh
  • nvidia-setup/skyhook_dir/steps_check/configure_hpc_networking_check.sh
  • nvidia-setup/skyhook_dir/steps_check/configure_limits_check.sh
  • nvidia-setup/skyhook_dir/steps_check/install_doca_check.sh
  • nvidia-setup/skyhook_dir/steps_check/install_lustre_oke_check.sh
  • nvidia-setup/skyhook_dir/steps_check/install_oci_hpc_packages_check.sh
  • nvidia-setup/skyhook_dir/utilities.sh
  • nvidia-tuned/README.md
  • nvidia-tuned/config.json
  • nvidia-tuned/profiles/service/oke/script.sh
  • nvidia-tuned/profiles/service/oke/tuned.conf.template
  • tests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.sh
  • tests/integration/nvidia_setup/test_apply_oke.py
  • tests/integration/nvidia_tuned/test_prepare_nvidia_profiles.py

Comment thread nvidia-setup/README.md
1. **upgrade** – `apt-get update && apt-get upgrade -y`
2. **install_doca** – Mellanox DOCA/OFED + `mstflint` from the DOCA repo (idempotent; fixes OFED symlinks)
3. **install_oci_hpc_packages** – downloads + installs the OCI HPC `network-device-names` and `nvidia-gpu-configure` debs (matches Oracle's modern `use_plugins` set)
4. **configure_hpc_networking** – drops the OCI RDMA udev rules, SR-IOV VF helpers, and the OCA `rdma_network.json`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix OCI acronym typo in OKE networking step.

Line 94 says OCA rdma_network.json; this should be OCI to match the rest of the OKE/OCI documentation and avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/README.md` at line 94, The configure_hpc_networking section in
the README.md contains an incorrect acronym "OCA" that should be "OCI" for
consistency with OKE and OCI documentation standards. Locate the line describing
the configure_hpc_networking step that mentions "OCA rdma_network.json" and
change "OCA" to "OCI" to match the proper Oracle Cloud Infrastructure
terminology used throughout the documentation.

LUSTRE_DKMS_NAME=lustre-client-modules
LUSTRE_DKMS_VERSION=2.15.90-v35817661
LUSTRE_PKG_VERSION=2.15.90-v35817661-1
LUSTRE_ARTIFACTS_URL=REPLACE_WITH_OCI_LUSTRE_DEB_URL

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Replace the Lustre artifacts placeholder with a deployable value.

LUSTRE_ARTIFACTS_URL is still a placeholder, so OKE GB200 Lustre bootstrap cannot fetch packages in a real run.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/defaults/oke-gb200.conf` at line 11, The
LUSTRE_ARTIFACTS_URL variable in oke-gb200.conf is currently set to a
placeholder string REPLACE_WITH_OCI_LUSTRE_DEB_URL instead of an actual
deployable URL. Replace this placeholder value with the correct OCI Lustre DEB
package repository URL so that the OKE GB200 Lustre bootstrap can properly fetch
packages during deployment.

LUSTRE_DKMS_NAME=lustre-client-modules
LUSTRE_DKMS_VERSION=2.15.90-v35817661
LUSTRE_PKG_VERSION=2.15.90-v35817661-1
LUSTRE_ARTIFACTS_URL=REPLACE_WITH_OCI_LUSTRE_DEB_URL

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not ship OKE H100 defaults with a Lustre URL placeholder.

Line 11 keeps a non-resolvable placeholder, which breaks the default Lustre package retrieval path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/defaults/oke-h100.conf` at line 11, Replace the
placeholder value in the LUSTRE_ARTIFACTS_URL variable with an actual,
resolvable OCI Lustre DEB package URL. The current placeholder value
REPLACE_WITH_OCI_LUSTRE_DEB_URL needs to be replaced with the correct URL that
points to the OCI Lustre DEB artifacts repository so that the default Lustre
package retrieval works properly in the OKE H100 configuration.

@@ -0,0 +1,30 @@
#!/bin/bash

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Use the portable shebang for consistency.

The shebang should be #!/usr/bin/env bash per coding guidelines. As per coding guidelines, all shell scripts should use the portable env form to locate bash.

🔧 Proposed fix
-#!/bin/bash
+#!/usr/bin/env bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
#!/usr/bin/env bash
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/steps_check/configure_hpc_networking_check.sh` at
line 1, The shebang at the start of the configure_hpc_networking_check.sh script
uses the direct bash path form which is not portable across different systems.
Replace the current shebang `#!/bin/bash` with the portable form `#!/usr/bin/env
bash` which uses the env utility to locate bash in the system's PATH, ensuring
the script works consistently across different environments following the
established coding guidelines.

Source: Coding guidelines

@@ -0,0 +1,24 @@
#!/bin/bash

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Use #!/usr/bin/env bash shebang instead of /bin/bash.

This makes the script portable across systems where bash may be installed at different paths.

As per coding guidelines: Shell scripts must use shebang #!/usr/bin/env bash.

🐛 Fix shebang
-#!/bin/bash
+#!/usr/bin/env bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
#!/usr/bin/env bash
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/steps_check/configure_limits_check.sh` at line 1,
The shebang line in the configure_limits_check.sh script uses a hardcoded path
to bash which may not be portable across all systems. Replace the shebang
`#!/bin/bash` with `#!/usr/bin/env bash` at the first line of the file to ensure
the script can find and execute bash regardless of its installation location on
the system.

Source: Coding guidelines

Comment on lines +1 to +7
#!/bin/bash

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# TuneD script plugin lifecycle: start | stop [full_rollback] | verify [ignore_missing]
set -e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Update shebang and strict mode to match repository guidelines.

The script uses #!/bin/bash and set -e, but repository guidelines require #!/usr/bin/env bash and set -euo pipefail for all lifecycle scripts (excluding -check scripts).

🔧 Proposed fix
-#!/bin/bash
+#!/usr/bin/env bash

 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0

 # TuneD script plugin lifecycle: start | stop [full_rollback] | verify [ignore_missing]
-set -e
+set -euo pipefail
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-tuned/profiles/service/oke/script.sh` around lines 1 - 7, The script
does not conform to repository guidelines for lifecycle scripts. Replace the
shebang `#!/bin/bash` with `#!/usr/bin/env bash` on the first line. Update the
strict mode from `set -e` to `set -euo pipefail` to enforce additional safety
checks (undefined variable handling and pipe failure detection) as required by
the repository standards for lifecycle scripts.

Source: Coding guidelines

@@ -0,0 +1,20 @@
#!/bin/bash

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Use portable shebang.

The script uses #!/bin/bash but repository guidelines require #!/usr/bin/env bash for portability.

🔧 Proposed fix
-#!/bin/bash
+#!/usr/bin/env bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
#!/usr/bin/env bash
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.sh` at line 1,
The shebang line at the beginning of the resolve_kernel_wrapper.sh script uses
#!/bin/bash, which is not portable across different systems. Replace the shebang
#!/bin/bash with #!/usr/bin/env bash to follow repository guidelines and ensure
better portability across different Unix-like systems where bash may be
installed in different locations.

Source: Coding guidelines

Comment on lines +3 to +17
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔴 Critical | ⚡ Quick win

Remove duplicate license header.

This file has both the SPDX short form (lines 3-4) and the full Apache 2.0 license block (lines 7-17). Per repository guidelines, source files should carry only the SPDX short form added by make license-fmt, and CI will fail on duplicated headers.

Remove lines 5-17 (the blank line and full Apache block), keeping only the SPDX header at lines 3-4.

🔧 Proposed fix
 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-#
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.

 . "${SKYHOOK_DIR}/skyhook_dir/utilities.sh"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
. "${SKYHOOK_DIR}/skyhook_dir/utilities.sh"
resolve_full_kernel "$1"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/nvidia_setup/fixtures/resolve_kernel_wrapper.sh` around
lines 3 - 17, The file contains duplicate license headers with both the SPDX
short form (the SPDX-FileCopyrightText and SPDX-License-Identifier comments) and
the full Apache License 2.0 text block. Per repository guidelines, keep only the
SPDX short form header and remove the blank line and the entire full Apache
license block starting from "Licensed under the Apache License" through the
final "limitations under the License" closing comment. Ensure only the SPDX
header comments remain at the top of the resolve_kernel_wrapper.sh file.

Source: Coding guidelines

Comment on lines +3 to +17
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔴 Critical | ⚡ Quick win

Remove duplicate license header.

This file has both the SPDX short form (lines 3-4) and the full Apache 2.0 license block (lines 7-17). Per repository guidelines, source files should carry only the SPDX short form, and CI will fail on duplicated headers.

Remove lines 5-17 (the blank line and full Apache block), keeping only the SPDX header at lines 3-4.

🔧 Proposed fix
 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-#
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.

 """Tests for nvidia-setup oke flavor."""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for nvidia-setup oke flavor."""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/nvidia_setup/test_apply_oke.py` around lines 3 - 17, The
test_apply_oke.py file contains duplicate license headers with both the SPDX
short form at the beginning and a full Apache 2.0 license block below it. Per
repository guidelines, only the SPDX short form header should be retained.
Remove the blank line following the SPDX header and the entire Apache 2.0
license block (all lines from "Licensed under the Apache License" through
"limitations under the License."), keeping only the SPDX FileCopyrightText and
Identifier comment lines at the top.

Source: Coding guidelines

)
assert_exit_code(result, 0)
conf = runner.get_file_contents("/etc/security/limits.d/99-oci-hpc.conf")
assert "memlock" in conf and "unlimited" in conf

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Break down compound assertion.

The assertion on line 164 checks two conditions in one statement, making it harder to debug which part failed. Split into separate assertions for clearer failure messages.

♻️ Proposed refactor
         conf = runner.get_file_contents("/etc/security/limits.d/99-oci-hpc.conf")
-        assert "memlock" in conf and "unlimited" in conf
+        assert "memlock" in conf
+        assert "unlimited" in conf
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert "memlock" in conf and "unlimited" in conf
conf = runner.get_file_contents("/etc/security/limits.d/99-oci-hpc.conf")
assert "memlock" in conf
assert "unlimited" in conf
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 164-164: Assertion should be broken down into multiple parts

Break down assertion into multiple parts

(PT018)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/nvidia_setup/test_apply_oke.py` at line 164, The assertion
on line 164 combines two conditions with an "and" operator, which makes it
unclear which condition failed when the test fails. Split this compound
assertion into two separate assert statements, one checking that "memlock" is in
conf and another checking that "unlimited" is in conf, so that each condition
produces a distinct failure message for easier debugging.

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Documentation addition, correction, or improvement package/nvidia-setup package/nvidia-tuned tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants