Skip to content

feat(nvidia-setup): EKS h100/gb200 → 6.17.0-1017-aws kernel + EFA 1.48.0#53

Open
ayuskauskas wants to merge 4 commits into
mainfrom
worktree-eks-update
Open

feat(nvidia-setup): EKS h100/gb200 → 6.17.0-1017-aws kernel + EFA 1.48.0#53
ayuskauskas wants to merge 4 commits into
mainfrom
worktree-eks-update

Conversation

@ayuskauskas

Copy link
Copy Markdown
Collaborator

Summary

  • Bump EKS h100/gb200 defaults to kernel 6.17.0-1017-aws and EFA 1.48.0 (skyhook_dir/defaults/eks-h100.conf, eks-gb200.conf).
  • resolve_full_kernel now appends the -64k page-size variant on arm64, so GB200 (Grace) installs 6.17.0-1017-aws-64k while H100 (x86_64) stays on 6.17.0-1017-aws. This fixes a latent bug: the old *-aws* dedup branch short-circuited the 64k logic, so GB200 was silently resolving to the non-64k flavor.
  • Docs: cut a 0.4.0 CHANGELOG section; relabel the already-tagged bcm feature as 0.3.0 - 2026-05-29 (it had been left under [Unreleased]); add a 0.4.x section to VERSION_OVERVIEW.

Test Plan

  • shellcheck skyhook_dir/utilities.sh — clean
  • resolve_full_kernel unit checks: x86_64 → 6.17.0-1017-aws, aarch64 → 6.17.0-1017-aws-64k, idempotent when base already ends in -64k
  • Docker-based integration suite (run separately, per usual workflow)

…nd EFA 1.48.0

- Bump eks-h100/eks-gb200 conf defaults to kernel 6.17.0-1017-aws, EFA 1.48.0
- resolve_full_kernel: append the -64k page-size variant on arm64 so GB200
  (Grace) installs 6.17.0-1017-aws-64k while H100 stays on 6.17.0-1017-aws.
  The old *-aws* dedup branch short-circuited the 64k logic, so GB200 was
  silently getting the non-64k flavor
- docs: add 0.4.0 to CHANGELOG, relabel the already-tagged bcm feature as
  0.3.0, and add a 0.4.x VERSION_OVERVIEW section
@ayuskauskas ayuskauskas self-assigned this Jun 10, 2026
@ayuskauskas ayuskauskas added the enhancement New feature or request label Jun 10, 2026
@github-actions github-actions Bot added doc Documentation addition, correction, or improvement package/nvidia-setup labels Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates EKS GPU kernel and EFA defaults, changes kernel flavor resolution to use architecture-based -64k handling, and adds dpkg-healing retry logic to kernel package installation. The H100 and GB200 defaults are bumped to kernel 6.17.0-1019-aws and EFA 1.48.0. Documentation is updated to reflect the new defaults and the GB200 BCM matrix entry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed It describes the main EKS kernel/EFA bump, though it omits the utility and docs updates.
Description check ✅ Passed It is clearly about the same EKS defaults, kernel resolution, and docs changes in this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-eks-update

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nvidia-setup/README.md`:
- Line 22: The table entry for GB200 is ambiguous because resolve_full_kernel()
appends "-64k" for arm64; update the table row that currently shows
"6.17.0-1017-aws" to the effective resolved value "6.17.0-1017-aws-64k" (or add
a one-line footnote explaining that resolve_full_kernel() appends "-64k" for
arm64 kernels) so readers see the exact kernel string used by the GB200 support
matrix.

In `@nvidia-setup/VERSION_OVERVIEW.md`:
- Line 26: The new top-level heading "# 0.4.x" should be demoted to a level-2
heading to keep a single H1 in the file; locate the heading text "0.4.x" (and
any other version headings) and replace the leading single "#" with "##" so all
version sections use H2 while preserving the existing single H1 at the top of
VERSION_OVERVIEW.md.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2daf74d2-1ce0-418b-88a3-0efc1dbd834b

📥 Commits

Reviewing files that changed from the base of the PR and between 6b796b7 and fed1c9f.

⛔ Files ignored due to path filters (1)
  • nvidia-setup/CHANGELOG.md is excluded by !**/CHANGELOG.md
📒 Files selected for processing (5)
  • nvidia-setup/README.md
  • nvidia-setup/VERSION_OVERVIEW.md
  • nvidia-setup/skyhook_dir/defaults/eks-gb200.conf
  • nvidia-setup/skyhook_dir/defaults/eks-h100.conf
  • nvidia-setup/skyhook_dir/utilities.sh

Comment thread nvidia-setup/README.md Outdated
| eks | h100 | 6.14.0-1018-aws | 1.47.0 |
| eks | gb200 | 6.14.0-1018-aws | 1.47.0 |
| eks | h100 | 6.17.0-1017-aws | 1.48.0 |
| eks | gb200 | 6.17.0-1017-aws | 1.48.0 |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clarify the effective GB200 kernel value in the supported combinations table.

Line 22 lists 6.17.0-1017-aws, but resolve_full_kernel() now resolves arm64 to 6.17.0-1017-aws-64k. Please make the table explicit about the resolved GB200 value, or add a short note that arm64 appends -64k.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/README.md` at line 22, The table entry for GB200 is ambiguous
because resolve_full_kernel() appends "-64k" for arm64; update the table row
that currently shows "6.17.0-1017-aws" to the effective resolved value
"6.17.0-1017-aws-64k" (or add a one-line footnote explaining that
resolve_full_kernel() appends "-64k" for arm64 kernels) so readers see the exact
kernel string used by the GB200 support matrix.

| bcm | h100 | n/a | n/a | N | N | N | Y |
| bcm | gb200 | n/a | n/a | N | N | N | Y |

# 0.4.x

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Use a single H1 for this file; make version sections H2.

Line 26 adds another top-level # heading. Please keep one H1 in the file and use ## for version sections, including this new 0.4.x section.

As per coding guidelines, "Keep markdown well-formed: single # H1 per file, nested headings in order without skipped levels."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/VERSION_OVERVIEW.md` at line 26, The new top-level heading "#
0.4.x" should be demoted to a level-2 heading to keep a single H1 in the file;
locate the heading text "0.4.x" (and any other version headings) and replace the
leading single "#" with "##" so all version sections use H2 while preserving the
existing single H1 at the top of VERSION_OVERVIEW.md.

Source: Coding guidelines

…dpkg errors

Bump EKS h100/gb200 kernel defaults from 6.17.0-1017-aws to 6.17.0-1019-aws
across defaults, docs, and CHANGELOG.

Wrap the kernel apt-get install in apt_install_with_dpkg_heal: on a dpkg
error (e.g. dpkg left half-configured / "dpkg was interrupted"), run
`dpkg --configure -a` and retry the install once instead of failing the step.

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nvidia-setup/README.md (1)

22-22: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Keep the GB200 entry aligned with the resolved arm64 kernel.

Line 22 still shows 6.17.0-1019-aws, but GB200 resolves to 6.17.0-1019-aws-64k. Please show the resolved string here, or add a note that arm64 appends -64k.

Proposed fix
-| eks     | gb200       | 6.17.0-1019-aws     |  1.48.0      |
+| eks     | gb200       | 6.17.0-1019-aws-64k |  1.48.0      |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/README.md` at line 22, The GB200 README entry is still showing
the unresolved kernel string, so update the table row for the gb200 platform to
match the actual resolved arm64 kernel name used by the setup flow, or add a
note that arm64 appends the -64k suffix. Use the existing GB200/eks README entry
as the anchor and keep the documented kernel value aligned with what the
resolver returns.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nvidia-setup/skyhook_dir/steps/install_kernel.sh`:
- Around line 40-61: Tighten the retry condition in apt_install_with_dpkg_heal
so it only triggers on the interrupted-dpkg signatures described in the comment,
not any output containing “dpkg”. Update the grep match in
apt_install_with_dpkg_heal to look for the specific failure phrases (such as
“dpkg was interrupted” or “Sub-process /usr/bin/dpkg returned an error code”)
before running dpkg --configure -a and retrying apt-get install.

---

Duplicate comments:
In `@nvidia-setup/README.md`:
- Line 22: The GB200 README entry is still showing the unresolved kernel string,
so update the table row for the gb200 platform to match the actual resolved
arm64 kernel name used by the setup flow, or add a note that arm64 appends the
-64k suffix. Use the existing GB200/eks README entry as the anchor and keep the
documented kernel value aligned with what the resolver returns.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7e8b5433-99ab-4672-bdee-3d93c1c7e2e8

📥 Commits

Reviewing files that changed from the base of the PR and between fed1c9f and 858eb33.

⛔ Files ignored due to path filters (1)
  • nvidia-setup/CHANGELOG.md is excluded by !**/CHANGELOG.md
📒 Files selected for processing (6)
  • nvidia-setup/README.md
  • nvidia-setup/VERSION_OVERVIEW.md
  • nvidia-setup/skyhook_dir/defaults/eks-gb200.conf
  • nvidia-setup/skyhook_dir/defaults/eks-h100.conf
  • nvidia-setup/skyhook_dir/steps/install_kernel.sh
  • nvidia-setup/skyhook_dir/utilities.sh

Comment on lines +40 to +61
# Run `apt-get install` and self-heal from an interrupted dpkg state. If the
# install fails because dpkg was left half-configured (the classic
# "E: dpkg was interrupted, you must manually run 'dpkg --configure -a'" or
# "Sub-process /usr/bin/dpkg returned an error code" failures), repair the dpkg
# database with `dpkg --configure -a` and retry the install once.
apt_install_with_dpkg_heal() {
local output status
output="$(apt-get install -y "$@" 2>&1)" && status=0 || status=$?
printf '%s\n' "${output}"

if [[ "${status}" -eq 0 ]]; then
return 0
fi

if printf '%s\n' "${output}" | grep -qi 'dpkg'; then
echo "apt-get install failed with a dpkg error; running 'dpkg --configure -a' and retrying..." >&2
dpkg --configure -a
apt-get install -y "$@"
else
return "${status}"
fi
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Tighten the dpkg-failure detection match.

The docstring calls out two specific dpkg failure signatures, but grep -qi 'dpkg' matches any occurrence of the word in the captured output, including unrelated dependency/conflict errors that happen to mention "dpkg" (e.g. "dpkg: dependency problems prevent configuration"). In those cases the heal+retry message is misleading and the real root cause (a dependency problem, not an interrupted dpkg state) gets masked behind one wasted retry. Low risk since dpkg --configure -a + retry is harmless to run, but the diagnostic intent stated in the comment isn't actually what's implemented.

🩹 Proposed tighter match
-  if printf '%s\n' "${output}" | grep -qi 'dpkg'; then
+  if printf '%s\n' "${output}" | grep -qiE 'dpkg was interrupted|dpkg returned an error code'; then
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Run `apt-get install` and self-heal from an interrupted dpkg state. If the
# install fails because dpkg was left half-configured (the classic
# "E: dpkg was interrupted, you must manually run 'dpkg --configure -a'" or
# "Sub-process /usr/bin/dpkg returned an error code" failures), repair the dpkg
# database with `dpkg --configure -a` and retry the install once.
apt_install_with_dpkg_heal() {
local output status
output="$(apt-get install -y "$@" 2>&1)" && status=0 || status=$?
printf '%s\n' "${output}"
if [[ "${status}" -eq 0 ]]; then
return 0
fi
if printf '%s\n' "${output}" | grep -qi 'dpkg'; then
echo "apt-get install failed with a dpkg error; running 'dpkg --configure -a' and retrying..." >&2
dpkg --configure -a
apt-get install -y "$@"
else
return "${status}"
fi
}
# Run `apt-get install` and self-heal from an interrupted dpkg state. If the
# install fails because dpkg was left half-configured (the classic
# "E: dpkg was interrupted, you must manually run 'dpkg --configure -a'" or
# "Sub-process /usr/bin/dpkg returned an error code" failures), repair the dpkg
# database with `dpkg --configure -a` and retry the install once.
apt_install_with_dpkg_heal() {
local output status
output="$(apt-get install -y "$@" 2>&1)" && status=0 || status=$?
printf '%s\n' "${output}"
if [[ "${status}" -eq 0 ]]; then
return 0
fi
if printf '%s\n' "${output}" | grep -qiE 'dpkg was interrupted|dpkg returned an error code'; then
echo "apt-get install failed with a dpkg error; running 'dpkg --configure -a' and retrying..." >&2
dpkg --configure -a
apt-get install -y "$@"
else
return "${status}"
fi
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nvidia-setup/skyhook_dir/steps/install_kernel.sh` around lines 40 - 61,
Tighten the retry condition in apt_install_with_dpkg_heal so it only triggers on
the interrupted-dpkg signatures described in the comment, not any output
containing “dpkg”. Update the grep match in apt_install_with_dpkg_heal to look
for the specific failure phrases (such as “dpkg was interrupted” or “Sub-process
/usr/bin/dpkg returned an error code”) before running dpkg --configure -a and
retrying apt-get install.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Documentation addition, correction, or improvement enhancement New feature or request package/nvidia-setup

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants