feat(nvidia-setup): EKS h100/gb200 → 6.17.0-1017-aws kernel + EFA 1.48.0#53
feat(nvidia-setup): EKS h100/gb200 → 6.17.0-1017-aws kernel + EFA 1.48.0#53ayuskauskas wants to merge 4 commits into
Conversation
…nd EFA 1.48.0 - Bump eks-h100/eks-gb200 conf defaults to kernel 6.17.0-1017-aws, EFA 1.48.0 - resolve_full_kernel: append the -64k page-size variant on arm64 so GB200 (Grace) installs 6.17.0-1017-aws-64k while H100 stays on 6.17.0-1017-aws. The old *-aws* dedup branch short-circuited the 64k logic, so GB200 was silently getting the non-64k flavor - docs: add 0.4.0 to CHANGELOG, relabel the already-tagged bcm feature as 0.3.0, and add a 0.4.x VERSION_OVERVIEW section
📝 WalkthroughWalkthroughThis PR updates EKS GPU kernel and EFA defaults, changes kernel flavor resolution to use architecture-based Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nvidia-setup/README.md`:
- Line 22: The table entry for GB200 is ambiguous because resolve_full_kernel()
appends "-64k" for arm64; update the table row that currently shows
"6.17.0-1017-aws" to the effective resolved value "6.17.0-1017-aws-64k" (or add
a one-line footnote explaining that resolve_full_kernel() appends "-64k" for
arm64 kernels) so readers see the exact kernel string used by the GB200 support
matrix.
In `@nvidia-setup/VERSION_OVERVIEW.md`:
- Line 26: The new top-level heading "# 0.4.x" should be demoted to a level-2
heading to keep a single H1 in the file; locate the heading text "0.4.x" (and
any other version headings) and replace the leading single "#" with "##" so all
version sections use H2 while preserving the existing single H1 at the top of
VERSION_OVERVIEW.md.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 2daf74d2-1ce0-418b-88a3-0efc1dbd834b
⛔ Files ignored due to path filters (1)
nvidia-setup/CHANGELOG.mdis excluded by!**/CHANGELOG.md
📒 Files selected for processing (5)
nvidia-setup/README.mdnvidia-setup/VERSION_OVERVIEW.mdnvidia-setup/skyhook_dir/defaults/eks-gb200.confnvidia-setup/skyhook_dir/defaults/eks-h100.confnvidia-setup/skyhook_dir/utilities.sh
| | eks | h100 | 6.14.0-1018-aws | 1.47.0 | | ||
| | eks | gb200 | 6.14.0-1018-aws | 1.47.0 | | ||
| | eks | h100 | 6.17.0-1017-aws | 1.48.0 | | ||
| | eks | gb200 | 6.17.0-1017-aws | 1.48.0 | |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Clarify the effective GB200 kernel value in the supported combinations table.
Line 22 lists 6.17.0-1017-aws, but resolve_full_kernel() now resolves arm64 to 6.17.0-1017-aws-64k. Please make the table explicit about the resolved GB200 value, or add a short note that arm64 appends -64k.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/README.md` at line 22, The table entry for GB200 is ambiguous
because resolve_full_kernel() appends "-64k" for arm64; update the table row
that currently shows "6.17.0-1017-aws" to the effective resolved value
"6.17.0-1017-aws-64k" (or add a one-line footnote explaining that
resolve_full_kernel() appends "-64k" for arm64 kernels) so readers see the exact
kernel string used by the GB200 support matrix.
| | bcm | h100 | n/a | n/a | N | N | N | Y | | ||
| | bcm | gb200 | n/a | n/a | N | N | N | Y | | ||
|
|
||
| # 0.4.x |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use a single H1 for this file; make version sections H2.
Line 26 adds another top-level # heading. Please keep one H1 in the file and use ## for version sections, including this new 0.4.x section.
As per coding guidelines, "Keep markdown well-formed: single # H1 per file, nested headings in order without skipped levels."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/VERSION_OVERVIEW.md` at line 26, The new top-level heading "#
0.4.x" should be demoted to a level-2 heading to keep a single H1 in the file;
locate the heading text "0.4.x" (and any other version headings) and replace the
leading single "#" with "##" so all version sections use H2 while preserving the
existing single H1 at the top of VERSION_OVERVIEW.md.
Source: Coding guidelines
…dpkg errors Bump EKS h100/gb200 kernel defaults from 6.17.0-1017-aws to 6.17.0-1019-aws across defaults, docs, and CHANGELOG. Wrap the kernel apt-get install in apt_install_with_dpkg_heal: on a dpkg error (e.g. dpkg left half-configured / "dpkg was interrupted"), run `dpkg --configure -a` and retry the install once instead of failing the step. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nvidia-setup/README.md (1)
22-22: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winKeep the GB200 entry aligned with the resolved arm64 kernel.
Line 22 still shows
6.17.0-1019-aws, but GB200 resolves to6.17.0-1019-aws-64k. Please show the resolved string here, or add a note that arm64 appends-64k.Proposed fix
-| eks | gb200 | 6.17.0-1019-aws | 1.48.0 | +| eks | gb200 | 6.17.0-1019-aws-64k | 1.48.0 |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nvidia-setup/README.md` at line 22, The GB200 README entry is still showing the unresolved kernel string, so update the table row for the gb200 platform to match the actual resolved arm64 kernel name used by the setup flow, or add a note that arm64 appends the -64k suffix. Use the existing GB200/eks README entry as the anchor and keep the documented kernel value aligned with what the resolver returns.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nvidia-setup/skyhook_dir/steps/install_kernel.sh`:
- Around line 40-61: Tighten the retry condition in apt_install_with_dpkg_heal
so it only triggers on the interrupted-dpkg signatures described in the comment,
not any output containing “dpkg”. Update the grep match in
apt_install_with_dpkg_heal to look for the specific failure phrases (such as
“dpkg was interrupted” or “Sub-process /usr/bin/dpkg returned an error code”)
before running dpkg --configure -a and retrying apt-get install.
---
Duplicate comments:
In `@nvidia-setup/README.md`:
- Line 22: The GB200 README entry is still showing the unresolved kernel string,
so update the table row for the gb200 platform to match the actual resolved
arm64 kernel name used by the setup flow, or add a note that arm64 appends the
-64k suffix. Use the existing GB200/eks README entry as the anchor and keep the
documented kernel value aligned with what the resolver returns.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7e8b5433-99ab-4672-bdee-3d93c1c7e2e8
⛔ Files ignored due to path filters (1)
nvidia-setup/CHANGELOG.mdis excluded by!**/CHANGELOG.md
📒 Files selected for processing (6)
nvidia-setup/README.mdnvidia-setup/VERSION_OVERVIEW.mdnvidia-setup/skyhook_dir/defaults/eks-gb200.confnvidia-setup/skyhook_dir/defaults/eks-h100.confnvidia-setup/skyhook_dir/steps/install_kernel.shnvidia-setup/skyhook_dir/utilities.sh
| # Run `apt-get install` and self-heal from an interrupted dpkg state. If the | ||
| # install fails because dpkg was left half-configured (the classic | ||
| # "E: dpkg was interrupted, you must manually run 'dpkg --configure -a'" or | ||
| # "Sub-process /usr/bin/dpkg returned an error code" failures), repair the dpkg | ||
| # database with `dpkg --configure -a` and retry the install once. | ||
| apt_install_with_dpkg_heal() { | ||
| local output status | ||
| output="$(apt-get install -y "$@" 2>&1)" && status=0 || status=$? | ||
| printf '%s\n' "${output}" | ||
|
|
||
| if [[ "${status}" -eq 0 ]]; then | ||
| return 0 | ||
| fi | ||
|
|
||
| if printf '%s\n' "${output}" | grep -qi 'dpkg'; then | ||
| echo "apt-get install failed with a dpkg error; running 'dpkg --configure -a' and retrying..." >&2 | ||
| dpkg --configure -a | ||
| apt-get install -y "$@" | ||
| else | ||
| return "${status}" | ||
| fi | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Tighten the dpkg-failure detection match.
The docstring calls out two specific dpkg failure signatures, but grep -qi 'dpkg' matches any occurrence of the word in the captured output, including unrelated dependency/conflict errors that happen to mention "dpkg" (e.g. "dpkg: dependency problems prevent configuration"). In those cases the heal+retry message is misleading and the real root cause (a dependency problem, not an interrupted dpkg state) gets masked behind one wasted retry. Low risk since dpkg --configure -a + retry is harmless to run, but the diagnostic intent stated in the comment isn't actually what's implemented.
🩹 Proposed tighter match
- if printf '%s\n' "${output}" | grep -qi 'dpkg'; then
+ if printf '%s\n' "${output}" | grep -qiE 'dpkg was interrupted|dpkg returned an error code'; then📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Run `apt-get install` and self-heal from an interrupted dpkg state. If the | |
| # install fails because dpkg was left half-configured (the classic | |
| # "E: dpkg was interrupted, you must manually run 'dpkg --configure -a'" or | |
| # "Sub-process /usr/bin/dpkg returned an error code" failures), repair the dpkg | |
| # database with `dpkg --configure -a` and retry the install once. | |
| apt_install_with_dpkg_heal() { | |
| local output status | |
| output="$(apt-get install -y "$@" 2>&1)" && status=0 || status=$? | |
| printf '%s\n' "${output}" | |
| if [[ "${status}" -eq 0 ]]; then | |
| return 0 | |
| fi | |
| if printf '%s\n' "${output}" | grep -qi 'dpkg'; then | |
| echo "apt-get install failed with a dpkg error; running 'dpkg --configure -a' and retrying..." >&2 | |
| dpkg --configure -a | |
| apt-get install -y "$@" | |
| else | |
| return "${status}" | |
| fi | |
| } | |
| # Run `apt-get install` and self-heal from an interrupted dpkg state. If the | |
| # install fails because dpkg was left half-configured (the classic | |
| # "E: dpkg was interrupted, you must manually run 'dpkg --configure -a'" or | |
| # "Sub-process /usr/bin/dpkg returned an error code" failures), repair the dpkg | |
| # database with `dpkg --configure -a` and retry the install once. | |
| apt_install_with_dpkg_heal() { | |
| local output status | |
| output="$(apt-get install -y "$@" 2>&1)" && status=0 || status=$? | |
| printf '%s\n' "${output}" | |
| if [[ "${status}" -eq 0 ]]; then | |
| return 0 | |
| fi | |
| if printf '%s\n' "${output}" | grep -qiE 'dpkg was interrupted|dpkg returned an error code'; then | |
| echo "apt-get install failed with a dpkg error; running 'dpkg --configure -a' and retrying..." >&2 | |
| dpkg --configure -a | |
| apt-get install -y "$@" | |
| else | |
| return "${status}" | |
| fi | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nvidia-setup/skyhook_dir/steps/install_kernel.sh` around lines 40 - 61,
Tighten the retry condition in apt_install_with_dpkg_heal so it only triggers on
the interrupted-dpkg signatures described in the comment, not any output
containing “dpkg”. Update the grep match in apt_install_with_dpkg_heal to look
for the specific failure phrases (such as “dpkg was interrupted” or “Sub-process
/usr/bin/dpkg returned an error code”) before running dpkg --configure -a and
retrying apt-get install.
Summary
6.17.0-1017-awsand EFA1.48.0(skyhook_dir/defaults/eks-h100.conf,eks-gb200.conf).resolve_full_kernelnow appends the-64kpage-size variant on arm64, so GB200 (Grace) installs6.17.0-1017-aws-64kwhile H100 (x86_64) stays on6.17.0-1017-aws. This fixes a latent bug: the old*-aws*dedup branch short-circuited the 64k logic, so GB200 was silently resolving to the non-64k flavor.0.4.0CHANGELOG section; relabel the already-tagged bcm feature as0.3.0 - 2026-05-29(it had been left under[Unreleased]); add a0.4.xsection to VERSION_OVERVIEW.Test Plan
shellcheck skyhook_dir/utilities.sh— cleanresolve_full_kernelunit checks: x86_64 →6.17.0-1017-aws, aarch64 →6.17.0-1017-aws-64k, idempotent when base already ends in-64k