feat(ci): consolidate KWOK tier workflows into single reusable runner#1515
feat(ci): consolidate KWOK tier workflows into single reusable runner#1515mohityadav8 wants to merge 10 commits into
Conversation
- Replace kwok-tier3-shard.yaml with kwok-test-run.yaml; the checkout/
load-versions/kwok-test step block now lives in exactly one place
- Define readonly DEPLOYERS once in discover; Tier 1, Tier 2, and Tier 3
all derive from it — no more hardcoded matrix axes or sync comments
- Emit tier1_pairs and tier2_pairs from discover; all three tiers call
kwok-test-run.yaml with pre-built {recipe,deployer} pairs
- Tier 2 deployer coverage: helm-only (deliberate; full coverage in Tier 3)
- Fix workflow_dispatch early-exit to also emit tier1_pairs/tier2_pairs
- Add Tier 1 pair-count guard (warning at >200)
- Update paths triggers and ADR-003 workflow diagram
Closes NVIDIA#1172
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR generalizes Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 267-276: Resolve the merge conflict in the KWOK workflow by
removing all literal conflict markers and keeping the intended
`kwok-test-run.yaml` path. In the affected sections of `kwok-recipes.yaml`,
delete the stale `kwok-tier3-shard.yaml` branch and the duplicate local
`deployers` redeclaration, preserving the existing wiring used by the workflow
jobs. Make sure the final YAML is clean and parsable with no `<<<<<<<`,
`=======`, or `>>>>>>>` markers anywhere in the affected blocks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 72ca4c1f-3e02-48c4-b803-e095a52a291d
📒 Files selected for processing (2)
.github/workflows/kwok-recipes.yaml.github/workflows/kwok-test-run.yaml
- Replace kwok-tier3-shard.yaml with kwok-test-run.yaml (step block in one place) - readonly DEPLOYERS single source of truth; Tier 1/2/3 all derive from it - discover emits tier1_pairs + tier2_pairs; all tiers call kwok-test-run.yaml - Tier 2 helm-only (deliberate, documented in ADR-003) - Fix workflow_dispatch early-exit to emit tier1_pairs/tier2_pairs Closes NVIDIA#1172
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/kwok-recipes.yaml (2)
87-90: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winKeep Tier 2’s helm-only policy tied to
DEPLOYERS.Line 260 bypasses the shared deployer source, so removing or renaming
helminDEPLOYERSwould not update Tier 2 despite the “change this one line” contract.Proposed fix
readonly DEPLOYERS='["helm","argocd-oci","argocd-helm-oci","argocd-git","flux-oci","flux-git"]' + readonly TIER2_DEPLOYER="helm"- tier2_pairs=$(echo "$tier2" | jq -c '[.[] | {recipe: ., deployer: "helm"}]') + if ! echo "$deployers" | jq -e --arg d "$TIER2_DEPLOYER" 'index($d) != null' >/dev/null; then + echo "::error::Tier 2 deployer '${TIER2_DEPLOYER}' is not present in DEPLOYERS" + exit 1 + fi + tier2_pairs=$(jq -cn \ + --argjson recipes "$tier2" \ + --arg deployer "$TIER2_DEPLOYER" ' + [ $recipes[] | {recipe: ., deployer: $deployer} ] + ')Also applies to: 255-260
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/kwok-recipes.yaml around lines 87 - 90, The Tier 2 helm-only check is bypassing the shared DEPLOYERS source, which breaks the single-source-of-truth contract. Update the Tier 2 gating logic in the workflow so it derives the helm policy from DEPLOYERS instead of hardcoding or separately listing helm, and make sure the early-exit path in the workflow_dispatch handling uses the same source. Use the DEPLOYERS declaration and the Tier 2 branch logic as the symbols to locate the affected code.
248-253: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winEnforce the Tier 1 hard matrix cap, not just the warning threshold.
Tier 1 is unbatched, so
tier1_pairsabove 256 violates the reusable workflow’s matrix contract and will fail before the test runner executes.Proposed fix
tier1_pair_count=$(echo "$tier1_pairs" | jq 'length') - if (( tier1_pair_count > 200 )); then + if (( tier1_pair_count > 256 )); then + echo "::error::Tier 1 has ${tier1_pair_count} pairs (>256) — add batching before invoking kwok-test-run.yaml" + exit 1 + elif (( tier1_pair_count > 200 )); then echo "::warning::Tier 1 has ${tier1_pair_count} pairs (>200) — consider adding batching before it reaches 256" fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/kwok-recipes.yaml around lines 248 - 253, In the kwok-recipes workflow, the Tier 1 check in the matrix-building logic only warns when the pair count grows past the soft threshold, but it does not enforce the reusable workflow’s hard 256-item cap. Update the Tier 1 validation around the tier1_pair_count guard to fail fast when tier1_pairs exceeds 256, using the existing tier1_pairs/tier1_pair_count and jq length flow, so the job stops before matrix creation instead of continuing with an invalid unbatched matrix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 87-90: The Tier 2 helm-only check is bypassing the shared
DEPLOYERS source, which breaks the single-source-of-truth contract. Update the
Tier 2 gating logic in the workflow so it derives the helm policy from DEPLOYERS
instead of hardcoding or separately listing helm, and make sure the early-exit
path in the workflow_dispatch handling uses the same source. Use the DEPLOYERS
declaration and the Tier 2 branch logic as the symbols to locate the affected
code.
- Around line 248-253: In the kwok-recipes workflow, the Tier 1 check in the
matrix-building logic only warns when the pair count grows past the soft
threshold, but it does not enforce the reusable workflow’s hard 256-item cap.
Update the Tier 1 validation around the tier1_pair_count guard to fail fast when
tier1_pairs exceeds 256, using the existing tier1_pairs/tier1_pair_count and jq
length flow, so the job stops before matrix creation instead of continuing with
an invalid unbatched matrix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: e737e3ff-fa91-4862-939a-b5a26a7b6bfa
📒 Files selected for processing (2)
.github/workflows/kwok-recipes.yaml.github/workflows/kwok-test-run.yaml
|
@mohityadav8 this PR now has merge conflicts with |
…te-kwok-workflows # Conflicts: # .github/workflows/kwok-recipes.yaml
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/kwok-recipes.yaml (1)
1-1: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winRemove the UTF-8 BOM from the workflow files — both
.github/workflows/kwok-recipes.yamland.github/workflows/kwok-test-run.yamlstart with a BOM, and GitHub Actions workflows should be plain UTF-8 YAML without one.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/kwok-recipes.yaml at line 1, Remove the UTF-8 BOM from the workflow file so it starts as plain UTF-8 YAML; update the affected GitHub Actions workflow content at the top of the file (and apply the same fix to kwok-test-run.yaml) so there is no leading BOM before the copyright line.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Line 1: Remove the UTF-8 BOM from the workflow file so it starts as plain
UTF-8 YAML; update the affected GitHub Actions workflow content at the top of
the file (and apply the same fix to kwok-test-run.yaml) so there is no leading
BOM before the copyright line.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ce191cae-cedc-4508-919b-e3ac9442cac8
📒 Files selected for processing (2)
.github/workflows/kwok-recipes.yaml.github/workflows/kwok-test-run.yaml
….com/mohityadav8/aicr into feat/1172-consolidate-kwok-workflows
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/kwok-recipes.yaml (1)
266-271: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick winHard-fail before emitting an oversized Tier 1 matrix.
This guard warns at
>200, but still letstier1_pairsexceed the reusable workflow’s 256-matrix cap, which would fail later with a less actionable workflow error. Add a cap check before writingtier1_pairs.Suggested fix
tier1_pair_count=$(echo "$tier1_pairs" | jq 'length') - if (( tier1_pair_count > 200 )); then + if (( tier1_pair_count > 256 )); then + echo "::error::Tier 1 has ${tier1_pair_count} pairs (>256) — add batching before passing this to kwok-test-run.yaml" + exit 1 + elif (( tier1_pair_count > 200 )); then echo "::warning::Tier 1 has ${tier1_pair_count} pairs (>200) — consider adding batching before it reaches 256" fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/kwok-recipes.yaml around lines 266 - 271, The Tier 1 size guard in the workflow only warns and can still let `tier1_pairs` exceed the reusable workflow matrix limit. Update the `tier1_pair_count` check in `kwok-recipes.yaml` to hard-fail before `tier1_pairs` is written when it is at or above the 256 cap, using the existing Tier 1 guard block so the workflow stops with a clear message instead of failing later.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 266-271: The Tier 1 size guard in the workflow only warns and can
still let `tier1_pairs` exceed the reusable workflow matrix limit. Update the
`tier1_pair_count` check in `kwok-recipes.yaml` to hard-fail before
`tier1_pairs` is written when it is at or above the 256 cap, using the existing
Tier 1 guard block so the workflow stops with a clear message instead of failing
later.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d8e6ac22-b5a8-4d4b-b34b-d79b2f6a348d
📒 Files selected for processing (2)
.github/workflows/kwok-recipes.yaml.github/workflows/kwok-test-run.yaml
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/kwok-recipes.yaml (1)
105-108: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winKeep Tier 2 within the shared pair contract.
Line 281 hard-codes
helmoutsideDEPLOYERSand does not guard the pair count before passing it tokwok-test-run.yaml. Derive the helm-only policy fromDEPLOYERS, then fail fast if Tier 2 would exceed the reusable workflow’spairscontract.Proposed fix
- # Deployer list — single source of truth, consumed by all tiers and by - # the workflow_dispatch early-exit below. To add or remove a deployer, - # change this one line; Tier 1, Tier 2, and Tier 3 all derive from it. + # Deployer list — single source of truth for supported deployers, + # consumed by Tier 1/Tier 3 and filtered by Tier 2's helm-only policy. + # To add or remove a deployer, change this one line. readonly DEPLOYERS='["helm","argocd-oci","argocd-helm-oci","argocd-git","flux-oci","flux-git"]'- tier2_pairs=$(echo "$tier2" | jq -c '[.[] | {recipe: ., deployer: "helm"}]') + tier2_deployers=$(jq -cn --argjson deployers "$deployers" '[ $deployers[] | select(. == "helm") ]') + tier2_deployer_count=$(echo "$tier2_deployers" | jq 'length') + if (( tier2_deployer_count != 1 )); then + echo "::error::Tier 2 requires helm to be present exactly once in DEPLOYERS" + exit 1 + fi + + tier2_pairs=$(jq -cn \ + --argjson recipes "$tier2" \ + --argjson deployers "$tier2_deployers" ' + [ $recipes[] as $r | $deployers[] as $d | {recipe: $r, deployer: $d} ] + ') + tier2_pair_count=$(echo "$tier2_pairs" | jq 'length') + if (( tier2_pair_count > 256 )); then + echo "::error::Tier 2 has ${tier2_pair_count} pairs (>256) — add batching before passing this to kwok-test-run.yaml" + exit 1 + elif (( tier2_pair_count > 200 )); then + echo "::warning::Tier 2 has ${tier2_pair_count} pairs (>200) — consider adding batching before it reaches 256" + fi- echo "Tier 2 (diff-aware): $(echo "$tier2" | jq 'length') recipe(s) × 1 deployer (helm) = $(echo "$tier2_pairs" | jq 'length') pair(s)" + echo "Tier 2 (diff-aware): $(echo "$tier2" | jq 'length') recipe(s) × ${tier2_deployer_count} deployer (helm) = ${tier2_pair_count} pair(s)"Also applies to: 276-281, 339-347
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/kwok-recipes.yaml around lines 105 - 108, Tier 2 is bypassing the shared DEPLOYERS contract by hard-coding helm and sending an unvalidated pair list to kwok-test-run.yaml. Update the Tier 2 logic in the workflow to derive the helm-only policy from DEPLOYERS instead of naming helm directly, and add a fail-fast guard before invoking the reusable workflow so it rejects any Tier 2 pair count that would violate the pairs contract. Use the existing DEPLOYERS symbol and the Tier 2 dispatch path to keep the source of truth centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 105-108: Tier 2 is bypassing the shared DEPLOYERS contract by
hard-coding helm and sending an unvalidated pair list to kwok-test-run.yaml.
Update the Tier 2 logic in the workflow to derive the helm-only policy from
DEPLOYERS instead of naming helm directly, and add a fail-fast guard before
invoking the reusable workflow so it rejects any Tier 2 pair count that would
violate the pairs contract. Use the existing DEPLOYERS symbol and the Tier 2
dispatch path to keep the source of truth centralized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c758ee98-f82f-4040-a495-75ba20507db1
📒 Files selected for processing (1)
.github/workflows/kwok-recipes.yaml
mchmarny
left a comment
There was a problem hiding this comment.
Nice consolidation — single-source DEPLOYERS, the fail-closed Tier 1 pair-count guard, and the workflow_dispatch early-exit fix are all solid. The Tier 2 helm-only switch is behavior-preserving: the kwok-test action already defaults to helm, so the old deployer-less Tier 2 matrix was running helm anyway. Path-trigger parity is maintained too.
Three non-blocking issues (all inline): an accidental UTF-8 BOM on line 1, and a doc gap — ADR-003 still describes the old kwok-tier3-shard.yaml and tier2 × deployer, and the §"Tier 2 deployer coverage" section the comments cite doesn't exist, yet the PR body claims the ADR diagram was updated. Plus one nit on the deployers alias. The workflow logic itself looks correct.
Heads-up: the PR is currently needs-rebase (mergeable state blocked) — rebase onto origin/main before merge.
| @@ -1,4 +1,4 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
This line now starts with a UTF-8 BOM (EF BB BF) — main doesn't have it, so it crept in via an editor. Strip it; yamllint can flag a BOM and it's just noise on a workflow file.
| # shellcheck disable=SC2034 | ||
| deployers="${DEPLOYERS}" |
There was a problem hiding this comment.
nit: this alias isn't needed. readonly only blocks writes, so --argjson deployers "$DEPLOYERS" works directly in the jq calls below. And DEPLOYERS/deployers is read later (deployer_count, the tier1 pairs jq), so SC2034 wouldn't actually fire — the disable comment is misleading. Drop the alias and use $DEPLOYERS throughout.
njhensley
left a comment
There was a problem hiding this comment.
📋 Multi-persona review — PR #1515
▎ Method: 4 persona passes (Correctness, Security, Operability/CI-DX, Domain-Architecture/Docs), each finding independently confirmed or refuted by a senior meta-reviewer against the resolved code. Line links/anchors pinned to head dcd8f12c.
▎ Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick · ✅ Confirmed non-issue
Overall assessment — Approve with comments
The workflow logic is clean and well-factored: collapsing the six-deployer list to a single readonly DEPLOYERS, deriving all tiers from it, and generalizing the ex–Tier-3 shard into one shared kwok-test-run.yaml runner is a real maintainability win. Every jq/bash path checks out (pair + batch generators dry-run correctly, all call sites guard the empty-matrix case, the pair-count guards are fail-closed with correct boundaries), and the security surface (permissions, secrets, pull_request trigger, SHA-pinned checkout) is unchanged and not over-privileged. No correctness or security blockers.
What keeps it from a clean approve is documentation consistency, which this repo treats as a same-PR requirement: the PR body states it "updates the ADR-003 workflow diagram," but the diff touches only the two workflow files — ADR-003 is untouched, so it now carries three dangling references to the renamed file and a stale diagram, and the workflow's own new comments point to an ADR section that doesn't exist. None of this breaks CI, but it ships known-stale architecture docs.
Findings anchored to files outside this diff (can't be inline)
🟠 Major — ADR-003 left stale: dangling refs to the renamed workflow + stale diagram (PR body claims it was updated)
docs/design/003-scaling-recipe-tests.md L9, L153, L178 (and diagram at L175)
This PR renames kwok-tier3-shard.yaml → kwok-test-run.yaml, but ADR-003 still names the old file at L9, L153, and L178 — git grep kwok-tier3-shard at head returns exactly these three hits, all now pointing at a file that no longer exists. The diagram at L175 still shows Tier 2 as matrix: tier2 × deployer, and the model no longer has any deployer: matrix axis at all. The PR Summary explicitly claims it updates "the ADR-003 workflow diagram," but gh pr view --json files shows only the two workflow files change.
Fix: In this PR, update ADR-003 — replace the three kwok-tier3-shard.yaml references with kwok-test-run.yaml (and reword "shard lane" → shared runner for all tiers), and redraw the diagram (Tier 1 = recipes × all deployers, Tier 2 = recipes × helm only, Tier 3 = full cross-product batched → kwok-test-run.yaml). If the ADR edits are truly out of scope, at minimum correct the PR body so it stops claiming an update it didn't make.
🟡 Minor — Onboarding docs still tell contributors to edit a deployer: matrix this PR removes
docs/contributor/tests.md L435 (parallel wording at docs/design/008-kwok-deployer-matrix.md L246)
This PR removes the literal deployer: [helm, …] matrix axes from Tier 1/Tier 3; deployers now come solely from readonly DEPLOYERS ("change this one line"). But tests.md:435 still says "Add the value to the deployer: matrix in Tier 1 and Tier 3" and ADR-008:246 says "Add deployer: [...] to Tier 1 and Tier 3 matrices" — those matrices no longer exist. (The "leave Tier 2 alone" guidance remains correct.)
Fix: Update both to: "Add the value to the DEPLOYERS array in discover (single source of truth; Tier 1 and Tier 3 derive from it). Tier 2 stays helm-only by design." Ideally folded into this PR since it's the same behavioral change.
🔵 Nitpick — discover still exports tier1/tier2/tier3 outputs that nothing consumes
.github/workflows/kwok-recipes.yaml L83–85 (unchanged context, so not inline)
Downstream jobs now reference only tier1_pairs, tier2_pairs, and tier3_batches. The tier1/tier2/tier3 job outputs are consumed only by discover's own summary echos. Harmless leftover surface — optionally drop the three unused outputs: entries (keep the internal shell vars, which the summary/batching still need).
The remaining findings (🟠 phantom-ADR comment, 🟡 BOM, 🔵 dead SC2034) are attached as inline comments on the diff.
✅ Confirmed non-issues (checked and cleared)
- Branch-protection safe. The required check is the aggregate
KWOK Test Summaryjob (per ADR-003), which is unchanged. Converting Tier 1/Tier 2 to reusable-workflow calls renames only the per-recipe leaf check-runs, which aren't individually required. - Empty-matrix guarded everywhere. All three
uses:call sites gate on!= '[]' && != ''; tier3 batches are non-empty by construction; theworkflow_dispatchearly-exit sets every consumed output. - Pair-count guards correct. Warn
>200/ error>256measures the exact quantity passed to the matrix;256(GitHub's cap) is allowed,257+fails closed. No off-by-one. - Tier 2 helm-only is behavior-preserving, not a coverage cut. The
kwok-testaction hasdefault: 'helm'and the old Tier 2 passed nodeployer— so it already ran helm-only. The PR just makes it explicit; full deployer coverage still runs in Tier 3 on push/nightly. - Security unchanged. Caller and reusable workflow are both
contents: read; no secrets passed or needed;on: pull_request(notpull_request_target) keeps forks read-only/secret-less; checkout stays SHA-pinned withpersist-credentials: false; the reusable ref is a local path. The one real script-injection sink (inputs.recipe/inputs.deployer→run:in.github/actions/kwok-test/action.yml) is pre-existing and untouched, low severity in a read-only secret-less runner — optional future hardening, not for this PR. - Rename hygiene within the workflows is complete. Path triggers updated; no residual
kwok-tier3-shardreferences inside the two workflow files (the only stragglers are in docs — see the Major above).
Summary
| Tier | Count | Items |
|---|---|---|
| 🔴 Blocker | 0 | — |
| 🟠 Major | 2 | ADR-003 stale (dangling refs + diagram, PR body overclaims); workflow comments cite phantom ADR section (inline) |
| 🟡 Minor | 2 | UTF-8 BOM on line 1 (inline); onboarding docs reference removed deployer: matrix |
| 🔵 Nitpick | 2 | Dead SC2034 + misleading alias comment (inline); unused discover outputs |
Recommendation: Approve with comments. No code blockers — the workflow refactor is sound. Before merge, please fold the ADR-003 updates (and ideally tests.md/ADR-008) into this PR to satisfy the repo's same-PR docs rule, since the PR body already claims the ADR was updated; strip the BOM while you're in the file. The two nitpicks are optional.
Reviewed with a multi-persona + adversarial meta-review workflow.
| # Coverage-policy decision: Tier 2 uses helm only to keep PR wall-clock | ||
| # time proportional to the change scope. Full deployer coverage runs in | ||
| # Tier 3 on every push to main and on the nightly schedule. See ADR-003 | ||
| # §"Tier 2 deployer coverage" for rationale and how to revisit this. |
There was a problem hiding this comment.
🟠 Major — New workflow comments cite a nonexistent ADR-003 §"Tier 2 deployer coverage"
This comment (and the duplicate at L337-338 on the test-tier2 job) directs readers to ADR-003 §"Tier 2 deployer coverage" for rationale. No such section exists — grep -in "deployer coverage" against docs/design/003-scaling-recipe-tests.md at this head returns nothing. It's a defect introduced by this PR: a shipped pointer into a phantom anchor.
Blast radius: A contributor revisiting the helm-only trade-off follows the pointer and finds nothing, defeating the comment's stated "how to revisit this" intent. Same dead pointer recurs at L337-338.
Fix: Add a ### Tier 2 deployer coverage subsection to ADR-003 capturing the rationale (helm-only keeps PR wall-clock proportional to change scope; full coverage runs in Tier 3 on push/nightly) so both comments resolve — or repoint them at the existing rationale in docs/contributor/tests.md / ADR-008.
| @@ -1,4 +1,4 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
🟡 Minor — UTF-8 BOM accidentally prepended to kwok-recipes.yaml
Line 1 now begins with bytes EF BB BF. Verified authoritatively: the origin/main blob starts 23 20 43 (# C); this head blob starts ef bb bf — the BOM is introduced by this PR (an editor artifact; it's the diff's first hunk). No other workflow in the repo carries a BOM.
Blast radius: None functional — GitHub Actions and yamllint 1.38.0 both tolerate a leading BOM (lint passes), so CI won't fail. Cost is hygiene: git-blame noise, an encoding that differs from every sibling file, and a latent trap for tooling that keys on the first byte.
Fix: Strip it so the first hunk disappears: sed -i '1s/^\xEF\xBB\xBF//' .github/workflows/kwok-recipes.yaml (or re-save as UTF-8 without BOM).
| # --- Tier 3: full matrix (all testable overlays) --- | ||
| tier3="$all" | ||
|
|
||
| # Local alias so jq --argjson calls below can use $deployers |
There was a problem hiding this comment.
🔵 Nitpick — Dead # shellcheck disable=SC2034 on a used variable, plus an inaccurate comment
The deployers="${DEPLOYERS}" alias is read at L262, L302, and L318, so SC2034 (unused variable) would never fire — the disable is a no-op. The comment also claims the alias is needed "so jq --argjson calls can use $deployers," but ${DEPLOYERS} can be passed to --argjson directly (exactly as done at L115 in the dispatch branch), so the alias buys nothing.
Blast radius: None functional; maintainer-confusing — a future reader trusts the "intentional / unused" comment and may keep propagating an alias that could just be deleted.
Fix: Either drop the alias and use ${DEPLOYERS} at the three call sites, or keep it and delete the misleading comment + SC2034 directive.
Summary
kwok-tier3-shard.yamlwithkwok-test-run.yaml; thecheckout/load-versions/kwok-teststep block now lives in exactly one place.DEPLOYERSonce indiscover; Tier 1, Tier 2, and Tier 3 all derive from it, removing hardcoded matrix axes and sync comments.tier1_pairsandtier2_pairsfromdiscover; all three tiers callkwok-test-run.yamlwith pre-built{recipe, deployer}pairs.workflow_dispatchearly-exit to also emittier1_pairsandtier2_pairs.Closes #1172