Skip to content

feat(ci): consolidate KWOK tier workflows into single reusable runner#1515

Open
mohityadav8 wants to merge 10 commits into
NVIDIA:mainfrom
mohityadav8:feat/1172-consolidate-kwok-workflows
Open

feat(ci): consolidate KWOK tier workflows into single reusable runner#1515
mohityadav8 wants to merge 10 commits into
NVIDIA:mainfrom
mohityadav8:feat/1172-consolidate-kwok-workflows

Conversation

@mohityadav8

@mohityadav8 mohityadav8 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace kwok-tier3-shard.yaml with kwok-test-run.yaml; the checkout/load-versions/kwok-test step block now lives in exactly one place.
  • Define readonly DEPLOYERS once in discover; Tier 1, Tier 2, and Tier 3 all derive from it, removing hardcoded matrix axes and sync comments.
  • Emit tier1_pairs and tier2_pairs from discover; all three tiers call kwok-test-run.yaml with pre-built {recipe, deployer} pairs.
  • Keep Tier 2 deployer coverage Helm-only (deliberate; full coverage remains in Tier 3).
  • Fix workflow_dispatch early-exit to also emit tier1_pairs and tier2_pairs.
  • Add a Tier 1 pair-count guard (warning at >200).
  • Update workflow path triggers and the ADR-003 workflow diagram.

Closes #1172

- Replace kwok-tier3-shard.yaml with kwok-test-run.yaml; the checkout/

  load-versions/kwok-test step block now lives in exactly one place

- Define readonly DEPLOYERS once in discover; Tier 1, Tier 2, and Tier 3

  all derive from it — no more hardcoded matrix axes or sync comments

- Emit tier1_pairs and tier2_pairs from discover; all three tiers call

  kwok-test-run.yaml with pre-built {recipe,deployer} pairs

- Tier 2 deployer coverage: helm-only (deliberate; full coverage in Tier 3)

- Fix workflow_dispatch early-exit to also emit tier1_pairs/tier2_pairs

- Add Tier 1 pair-count guard (warning at >200)

- Update paths triggers and ADR-003 workflow diagram

Closes NVIDIA#1172
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR generalizes kwok-test-run.yaml into a shared reusable workflow for all KWOK tiers and updates kwok-recipes.yaml to precompute {recipe,deployer} pair lists for Tier 1, Tier 2, and Tier 3. The discover job now emits tier1_pairs and tier2_pairs, the classification logic uses a shared deployer list, and the tier jobs call kwok-test-run.yaml with pairs inputs instead of separate matrix setups. Trigger path filters are updated to reference the new reusable workflow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main CI workflow consolidation change.
Description check ✅ Passed The description is directly related to the workflow consolidation and deployment-coverage changes.
Linked Issues check ✅ Passed The workflow now uses one reusable runner, one deployer source, and explicit Tier 2 helm-only coverage as requested.
Out of Scope Changes check ✅ Passed The changes stay within the KWOK workflow refactor and related documentation or trigger updates.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 267-276: Resolve the merge conflict in the KWOK workflow by
removing all literal conflict markers and keeping the intended
`kwok-test-run.yaml` path. In the affected sections of `kwok-recipes.yaml`,
delete the stale `kwok-tier3-shard.yaml` branch and the duplicate local
`deployers` redeclaration, preserving the existing wiring used by the workflow
jobs. Make sure the final YAML is clean and parsable with no `<<<<<<<`,
`=======`, or `>>>>>>>` markers anywhere in the affected blocks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 72ca4c1f-3e02-48c4-b803-e095a52a291d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b24b7d and 8de6e8f.

📒 Files selected for processing (2)
  • .github/workflows/kwok-recipes.yaml
  • .github/workflows/kwok-test-run.yaml

Comment thread .github/workflows/kwok-recipes.yaml Outdated
- Replace kwok-tier3-shard.yaml with kwok-test-run.yaml (step block in one place)

- readonly DEPLOYERS single source of truth; Tier 1/2/3 all derive from it

- discover emits tier1_pairs + tier2_pairs; all tiers call kwok-test-run.yaml

- Tier 2 helm-only (deliberate, documented in ADR-003)

- Fix workflow_dispatch early-exit to emit tier1_pairs/tier2_pairs

Closes NVIDIA#1172
@github-actions github-actions Bot added size/L and removed size/M labels Jun 29, 2026
@mohityadav8 mohityadav8 marked this pull request as ready for review June 29, 2026 10:25
@mohityadav8 mohityadav8 requested a review from a team as a code owner June 29, 2026 10:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/kwok-recipes.yaml (2)

87-90: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Keep Tier 2’s helm-only policy tied to DEPLOYERS.

Line 260 bypasses the shared deployer source, so removing or renaming helm in DEPLOYERS would not update Tier 2 despite the “change this one line” contract.

Proposed fix
           readonly DEPLOYERS='["helm","argocd-oci","argocd-helm-oci","argocd-git","flux-oci","flux-git"]'
+          readonly TIER2_DEPLOYER="helm"
-          tier2_pairs=$(echo "$tier2" | jq -c '[.[] | {recipe: ., deployer: "helm"}]')
+          if ! echo "$deployers" | jq -e --arg d "$TIER2_DEPLOYER" 'index($d) != null' >/dev/null; then
+            echo "::error::Tier 2 deployer '${TIER2_DEPLOYER}' is not present in DEPLOYERS"
+            exit 1
+          fi
+          tier2_pairs=$(jq -cn \
+            --argjson recipes "$tier2" \
+            --arg deployer "$TIER2_DEPLOYER" '
+              [ $recipes[] | {recipe: ., deployer: $deployer} ]
+            ')

Also applies to: 255-260

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/kwok-recipes.yaml around lines 87 - 90, The Tier 2
helm-only check is bypassing the shared DEPLOYERS source, which breaks the
single-source-of-truth contract. Update the Tier 2 gating logic in the workflow
so it derives the helm policy from DEPLOYERS instead of hardcoding or separately
listing helm, and make sure the early-exit path in the workflow_dispatch
handling uses the same source. Use the DEPLOYERS declaration and the Tier 2
branch logic as the symbols to locate the affected code.

248-253: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Enforce the Tier 1 hard matrix cap, not just the warning threshold.

Tier 1 is unbatched, so tier1_pairs above 256 violates the reusable workflow’s matrix contract and will fail before the test runner executes.

Proposed fix
           tier1_pair_count=$(echo "$tier1_pairs" | jq 'length')
-          if (( tier1_pair_count > 200 )); then
+          if (( tier1_pair_count > 256 )); then
+            echo "::error::Tier 1 has ${tier1_pair_count} pairs (>256) — add batching before invoking kwok-test-run.yaml"
+            exit 1
+          elif (( tier1_pair_count > 200 )); then
             echo "::warning::Tier 1 has ${tier1_pair_count} pairs (>200) — consider adding batching before it reaches 256"
           fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/kwok-recipes.yaml around lines 248 - 253, In the
kwok-recipes workflow, the Tier 1 check in the matrix-building logic only warns
when the pair count grows past the soft threshold, but it does not enforce the
reusable workflow’s hard 256-item cap. Update the Tier 1 validation around the
tier1_pair_count guard to fail fast when tier1_pairs exceeds 256, using the
existing tier1_pairs/tier1_pair_count and jq length flow, so the job stops
before matrix creation instead of continuing with an invalid unbatched matrix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 87-90: The Tier 2 helm-only check is bypassing the shared
DEPLOYERS source, which breaks the single-source-of-truth contract. Update the
Tier 2 gating logic in the workflow so it derives the helm policy from DEPLOYERS
instead of hardcoding or separately listing helm, and make sure the early-exit
path in the workflow_dispatch handling uses the same source. Use the DEPLOYERS
declaration and the Tier 2 branch logic as the symbols to locate the affected
code.
- Around line 248-253: In the kwok-recipes workflow, the Tier 1 check in the
matrix-building logic only warns when the pair count grows past the soft
threshold, but it does not enforce the reusable workflow’s hard 256-item cap.
Update the Tier 1 validation around the tier1_pair_count guard to fail fast when
tier1_pairs exceeds 256, using the existing tier1_pairs/tier1_pair_count and jq
length flow, so the job stops before matrix creation instead of continuing with
an invalid unbatched matrix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e737e3ff-fa91-4862-939a-b5a26a7b6bfa

📥 Commits

Reviewing files that changed from the base of the PR and between 8de6e8f and f5d8fdb.

📒 Files selected for processing (2)
  • .github/workflows/kwok-recipes.yaml
  • .github/workflows/kwok-test-run.yaml

@mohityadav8 mohityadav8 marked this pull request as draft June 29, 2026 11:06
@github-actions

Copy link
Copy Markdown
Contributor

@mohityadav8 this PR now has merge conflicts with main. Please rebase to resolve them.

…te-kwok-workflows

# Conflicts:
#	.github/workflows/kwok-recipes.yaml
@github-actions github-actions Bot added size/M and removed size/L labels Jun 29, 2026
@mohityadav8 mohityadav8 marked this pull request as ready for review June 29, 2026 22:41

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/kwok-recipes.yaml (1)

1-1: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Remove the UTF-8 BOM from the workflow files — both .github/workflows/kwok-recipes.yaml and .github/workflows/kwok-test-run.yaml start with a BOM, and GitHub Actions workflows should be plain UTF-8 YAML without one.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/kwok-recipes.yaml at line 1, Remove the UTF-8 BOM from the
workflow file so it starts as plain UTF-8 YAML; update the affected GitHub
Actions workflow content at the top of the file (and apply the same fix to
kwok-test-run.yaml) so there is no leading BOM before the copyright line.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Line 1: Remove the UTF-8 BOM from the workflow file so it starts as plain
UTF-8 YAML; update the affected GitHub Actions workflow content at the top of
the file (and apply the same fix to kwok-test-run.yaml) so there is no leading
BOM before the copyright line.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ce191cae-cedc-4508-919b-e3ac9442cac8

📥 Commits

Reviewing files that changed from the base of the PR and between f5d8fdb and bd126d4.

📒 Files selected for processing (2)
  • .github/workflows/kwok-recipes.yaml
  • .github/workflows/kwok-test-run.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/kwok-recipes.yaml (1)

266-271: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Hard-fail before emitting an oversized Tier 1 matrix.

This guard warns at >200, but still lets tier1_pairs exceed the reusable workflow’s 256-matrix cap, which would fail later with a less actionable workflow error. Add a cap check before writing tier1_pairs.

Suggested fix
           tier1_pair_count=$(echo "$tier1_pairs" | jq 'length')
-          if (( tier1_pair_count > 200 )); then
+          if (( tier1_pair_count > 256 )); then
+            echo "::error::Tier 1 has ${tier1_pair_count} pairs (>256) — add batching before passing this to kwok-test-run.yaml"
+            exit 1
+          elif (( tier1_pair_count > 200 )); then
             echo "::warning::Tier 1 has ${tier1_pair_count} pairs (>200) — consider adding batching before it reaches 256"
           fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/kwok-recipes.yaml around lines 266 - 271, The Tier 1 size
guard in the workflow only warns and can still let `tier1_pairs` exceed the
reusable workflow matrix limit. Update the `tier1_pair_count` check in
`kwok-recipes.yaml` to hard-fail before `tier1_pairs` is written when it is at
or above the 256 cap, using the existing Tier 1 guard block so the workflow
stops with a clear message instead of failing later.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 266-271: The Tier 1 size guard in the workflow only warns and can
still let `tier1_pairs` exceed the reusable workflow matrix limit. Update the
`tier1_pair_count` check in `kwok-recipes.yaml` to hard-fail before
`tier1_pairs` is written when it is at or above the 256 cap, using the existing
Tier 1 guard block so the workflow stops with a clear message instead of failing
later.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d8e6ac22-b5a8-4d4b-b34b-d79b2f6a348d

📥 Commits

Reviewing files that changed from the base of the PR and between bd126d4 and 83820ec.

📒 Files selected for processing (2)
  • .github/workflows/kwok-recipes.yaml
  • .github/workflows/kwok-test-run.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/kwok-recipes.yaml (1)

105-108: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Keep Tier 2 within the shared pair contract.

Line 281 hard-codes helm outside DEPLOYERS and does not guard the pair count before passing it to kwok-test-run.yaml. Derive the helm-only policy from DEPLOYERS, then fail fast if Tier 2 would exceed the reusable workflow’s pairs contract.

Proposed fix
-          # Deployer list — single source of truth, consumed by all tiers and by
-          # the workflow_dispatch early-exit below. To add or remove a deployer,
-          # change this one line; Tier 1, Tier 2, and Tier 3 all derive from it.
+          # Deployer list — single source of truth for supported deployers,
+          # consumed by Tier 1/Tier 3 and filtered by Tier 2's helm-only policy.
+          # To add or remove a deployer, change this one line.
           readonly DEPLOYERS='["helm","argocd-oci","argocd-helm-oci","argocd-git","flux-oci","flux-git"]'
-          tier2_pairs=$(echo "$tier2" | jq -c '[.[] | {recipe: ., deployer: "helm"}]')
+          tier2_deployers=$(jq -cn --argjson deployers "$deployers" '[ $deployers[] | select(. == "helm") ]')
+          tier2_deployer_count=$(echo "$tier2_deployers" | jq 'length')
+          if (( tier2_deployer_count != 1 )); then
+            echo "::error::Tier 2 requires helm to be present exactly once in DEPLOYERS"
+            exit 1
+          fi
+
+          tier2_pairs=$(jq -cn \
+            --argjson recipes "$tier2" \
+            --argjson deployers "$tier2_deployers" '
+              [ $recipes[] as $r | $deployers[] as $d | {recipe: $r, deployer: $d} ]
+            ')
+          tier2_pair_count=$(echo "$tier2_pairs" | jq 'length')
+          if (( tier2_pair_count > 256 )); then
+            echo "::error::Tier 2 has ${tier2_pair_count} pairs (>256) — add batching before passing this to kwok-test-run.yaml"
+            exit 1
+          elif (( tier2_pair_count > 200 )); then
+            echo "::warning::Tier 2 has ${tier2_pair_count} pairs (>200) — consider adding batching before it reaches 256"
+          fi
-          echo "Tier 2 (diff-aware):   $(echo "$tier2" | jq 'length') recipe(s) × 1 deployer (helm) = $(echo "$tier2_pairs" | jq 'length') pair(s)"
+          echo "Tier 2 (diff-aware):   $(echo "$tier2" | jq 'length') recipe(s) × ${tier2_deployer_count} deployer (helm) = ${tier2_pair_count} pair(s)"

Also applies to: 276-281, 339-347

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/kwok-recipes.yaml around lines 105 - 108, Tier 2 is
bypassing the shared DEPLOYERS contract by hard-coding helm and sending an
unvalidated pair list to kwok-test-run.yaml. Update the Tier 2 logic in the
workflow to derive the helm-only policy from DEPLOYERS instead of naming helm
directly, and add a fail-fast guard before invoking the reusable workflow so it
rejects any Tier 2 pair count that would violate the pairs contract. Use the
existing DEPLOYERS symbol and the Tier 2 dispatch path to keep the source of
truth centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/kwok-recipes.yaml:
- Around line 105-108: Tier 2 is bypassing the shared DEPLOYERS contract by
hard-coding helm and sending an unvalidated pair list to kwok-test-run.yaml.
Update the Tier 2 logic in the workflow to derive the helm-only policy from
DEPLOYERS instead of naming helm directly, and add a fail-fast guard before
invoking the reusable workflow so it rejects any Tier 2 pair count that would
violate the pairs contract. Use the existing DEPLOYERS symbol and the Tier 2
dispatch path to keep the source of truth centralized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c758ee98-f82f-4040-a495-75ba20507db1

📥 Commits

Reviewing files that changed from the base of the PR and between 83820ec and 79bd45c.

📒 Files selected for processing (1)
  • .github/workflows/kwok-recipes.yaml

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice consolidation — single-source DEPLOYERS, the fail-closed Tier 1 pair-count guard, and the workflow_dispatch early-exit fix are all solid. The Tier 2 helm-only switch is behavior-preserving: the kwok-test action already defaults to helm, so the old deployer-less Tier 2 matrix was running helm anyway. Path-trigger parity is maintained too.

Three non-blocking issues (all inline): an accidental UTF-8 BOM on line 1, and a doc gap — ADR-003 still describes the old kwok-tier3-shard.yaml and tier2 × deployer, and the §"Tier 2 deployer coverage" section the comments cite doesn't exist, yet the PR body claims the ADR diagram was updated. Plus one nit on the deployers alias. The workflow logic itself looks correct.

Heads-up: the PR is currently needs-rebase (mergeable state blocked) — rebase onto origin/main before merge.

@@ -1,4 +1,4 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line now starts with a UTF-8 BOM (EF BB BF) — main doesn't have it, so it crept in via an editor. Strip it; yamllint can flag a BOM and it's just noise on a workflow file.

Comment thread .github/workflows/kwok-recipes.yaml
Comment on lines +256 to +257
# shellcheck disable=SC2034
deployers="${DEPLOYERS}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this alias isn't needed. readonly only blocks writes, so --argjson deployers "$DEPLOYERS" works directly in the jq calls below. And DEPLOYERS/deployers is read later (deployer_count, the tier1 pairs jq), so SC2034 wouldn't actually fire — the disable comment is misleading. Drop the alias and use $DEPLOYERS throughout.

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Multi-persona review — PR #1515

▎ Method: 4 persona passes (Correctness, Security, Operability/CI-DX, Domain-Architecture/Docs), each finding independently confirmed or refuted by a senior meta-reviewer against the resolved code. Line links/anchors pinned to head dcd8f12c.
▎ Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick · ✅ Confirmed non-issue

Overall assessment — Approve with comments

The workflow logic is clean and well-factored: collapsing the six-deployer list to a single readonly DEPLOYERS, deriving all tiers from it, and generalizing the ex–Tier-3 shard into one shared kwok-test-run.yaml runner is a real maintainability win. Every jq/bash path checks out (pair + batch generators dry-run correctly, all call sites guard the empty-matrix case, the pair-count guards are fail-closed with correct boundaries), and the security surface (permissions, secrets, pull_request trigger, SHA-pinned checkout) is unchanged and not over-privileged. No correctness or security blockers.

What keeps it from a clean approve is documentation consistency, which this repo treats as a same-PR requirement: the PR body states it "updates the ADR-003 workflow diagram," but the diff touches only the two workflow files — ADR-003 is untouched, so it now carries three dangling references to the renamed file and a stale diagram, and the workflow's own new comments point to an ADR section that doesn't exist. None of this breaks CI, but it ships known-stale architecture docs.


Findings anchored to files outside this diff (can't be inline)

🟠 Major — ADR-003 left stale: dangling refs to the renamed workflow + stale diagram (PR body claims it was updated)
docs/design/003-scaling-recipe-tests.md L9, L153, L178 (and diagram at L175)

This PR renames kwok-tier3-shard.yamlkwok-test-run.yaml, but ADR-003 still names the old file at L9, L153, and L178 — git grep kwok-tier3-shard at head returns exactly these three hits, all now pointing at a file that no longer exists. The diagram at L175 still shows Tier 2 as matrix: tier2 × deployer, and the model no longer has any deployer: matrix axis at all. The PR Summary explicitly claims it updates "the ADR-003 workflow diagram," but gh pr view --json files shows only the two workflow files change.
Fix: In this PR, update ADR-003 — replace the three kwok-tier3-shard.yaml references with kwok-test-run.yaml (and reword "shard lane" → shared runner for all tiers), and redraw the diagram (Tier 1 = recipes × all deployers, Tier 2 = recipes × helm only, Tier 3 = full cross-product batched → kwok-test-run.yaml). If the ADR edits are truly out of scope, at minimum correct the PR body so it stops claiming an update it didn't make.

🟡 Minor — Onboarding docs still tell contributors to edit a deployer: matrix this PR removes
docs/contributor/tests.md L435 (parallel wording at docs/design/008-kwok-deployer-matrix.md L246)

This PR removes the literal deployer: [helm, …] matrix axes from Tier 1/Tier 3; deployers now come solely from readonly DEPLOYERS ("change this one line"). But tests.md:435 still says "Add the value to the deployer: matrix in Tier 1 and Tier 3" and ADR-008:246 says "Add deployer: [...] to Tier 1 and Tier 3 matrices" — those matrices no longer exist. (The "leave Tier 2 alone" guidance remains correct.)
Fix: Update both to: "Add the value to the DEPLOYERS array in discover (single source of truth; Tier 1 and Tier 3 derive from it). Tier 2 stays helm-only by design." Ideally folded into this PR since it's the same behavioral change.

🔵 Nitpick — discover still exports tier1/tier2/tier3 outputs that nothing consumes
.github/workflows/kwok-recipes.yaml L83–85 (unchanged context, so not inline)

Downstream jobs now reference only tier1_pairs, tier2_pairs, and tier3_batches. The tier1/tier2/tier3 job outputs are consumed only by discover's own summary echos. Harmless leftover surface — optionally drop the three unused outputs: entries (keep the internal shell vars, which the summary/batching still need).

The remaining findings (🟠 phantom-ADR comment, 🟡 BOM, 🔵 dead SC2034) are attached as inline comments on the diff.


✅ Confirmed non-issues (checked and cleared)

  • Branch-protection safe. The required check is the aggregate KWOK Test Summary job (per ADR-003), which is unchanged. Converting Tier 1/Tier 2 to reusable-workflow calls renames only the per-recipe leaf check-runs, which aren't individually required.
  • Empty-matrix guarded everywhere. All three uses: call sites gate on != '[]' && != ''; tier3 batches are non-empty by construction; the workflow_dispatch early-exit sets every consumed output.
  • Pair-count guards correct. Warn >200 / error >256 measures the exact quantity passed to the matrix; 256 (GitHub's cap) is allowed, 257+ fails closed. No off-by-one.
  • Tier 2 helm-only is behavior-preserving, not a coverage cut. The kwok-test action has default: 'helm' and the old Tier 2 passed no deployer — so it already ran helm-only. The PR just makes it explicit; full deployer coverage still runs in Tier 3 on push/nightly.
  • Security unchanged. Caller and reusable workflow are both contents: read; no secrets passed or needed; on: pull_request (not pull_request_target) keeps forks read-only/secret-less; checkout stays SHA-pinned with persist-credentials: false; the reusable ref is a local path. The one real script-injection sink (inputs.recipe/inputs.deployerrun: in .github/actions/kwok-test/action.yml) is pre-existing and untouched, low severity in a read-only secret-less runner — optional future hardening, not for this PR.
  • Rename hygiene within the workflows is complete. Path triggers updated; no residual kwok-tier3-shard references inside the two workflow files (the only stragglers are in docs — see the Major above).

Summary

Tier Count Items
🔴 Blocker 0
🟠 Major 2 ADR-003 stale (dangling refs + diagram, PR body overclaims); workflow comments cite phantom ADR section (inline)
🟡 Minor 2 UTF-8 BOM on line 1 (inline); onboarding docs reference removed deployer: matrix
🔵 Nitpick 2 Dead SC2034 + misleading alias comment (inline); unused discover outputs

Recommendation: Approve with comments. No code blockers — the workflow refactor is sound. Before merge, please fold the ADR-003 updates (and ideally tests.md/ADR-008) into this PR to satisfy the repo's same-PR docs rule, since the PR body already claims the ADR was updated; strip the BOM while you're in the file. The two nitpicks are optional.

Reviewed with a multi-persona + adversarial meta-review workflow.

# Coverage-policy decision: Tier 2 uses helm only to keep PR wall-clock
# time proportional to the change scope. Full deployer coverage runs in
# Tier 3 on every push to main and on the nightly schedule. See ADR-003
# §"Tier 2 deployer coverage" for rationale and how to revisit this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major — New workflow comments cite a nonexistent ADR-003 §"Tier 2 deployer coverage"

This comment (and the duplicate at L337-338 on the test-tier2 job) directs readers to ADR-003 §"Tier 2 deployer coverage" for rationale. No such section exists — grep -in "deployer coverage" against docs/design/003-scaling-recipe-tests.md at this head returns nothing. It's a defect introduced by this PR: a shipped pointer into a phantom anchor.

Blast radius: A contributor revisiting the helm-only trade-off follows the pointer and finds nothing, defeating the comment's stated "how to revisit this" intent. Same dead pointer recurs at L337-338.

Fix: Add a ### Tier 2 deployer coverage subsection to ADR-003 capturing the rationale (helm-only keeps PR wall-clock proportional to change scope; full coverage runs in Tier 3 on push/nightly) so both comments resolve — or repoint them at the existing rationale in docs/contributor/tests.md / ADR-008.

@@ -1,4 +1,4 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor — UTF-8 BOM accidentally prepended to kwok-recipes.yaml

Line 1 now begins with bytes EF BB BF. Verified authoritatively: the origin/main blob starts 23 20 43 (# C); this head blob starts ef bb bf — the BOM is introduced by this PR (an editor artifact; it's the diff's first hunk). No other workflow in the repo carries a BOM.

Blast radius: None functional — GitHub Actions and yamllint 1.38.0 both tolerate a leading BOM (lint passes), so CI won't fail. Cost is hygiene: git-blame noise, an encoding that differs from every sibling file, and a latent trap for tooling that keys on the first byte.

Fix: Strip it so the first hunk disappears: sed -i '1s/^\xEF\xBB\xBF//' .github/workflows/kwok-recipes.yaml (or re-save as UTF-8 without BOM).

# --- Tier 3: full matrix (all testable overlays) ---
tier3="$all"

# Local alias so jq --argjson calls below can use $deployers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nitpick — Dead # shellcheck disable=SC2034 on a used variable, plus an inaccurate comment

The deployers="${DEPLOYERS}" alias is read at L262, L302, and L318, so SC2034 (unused variable) would never fire — the disable is a no-op. The comment also claims the alias is needed "so jq --argjson calls can use $deployers," but ${DEPLOYERS} can be passed to --argjson directly (exactly as done at L115 in the dispatch branch), so the alias buys nothing.

Blast radius: None functional; maintainer-confusing — a future reader trusts the "intentional / unused" comment and may keep propagating an alias that could just be deleted.

Fix: Either drop the alias and use ${DEPLOYERS} at the three call sites, or keep it and delete the misleading comment + SC2034 directive.

@mohityadav8 mohityadav8 requested a review from a team as a code owner July 1, 2026 19:03
@mohityadav8 mohityadav8 requested a review from njhensley July 1, 2026 19:06
@mohityadav8 mohityadav8 requested a review from mchmarny July 2, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate KWOK tier workflows: single reusable test workflow + one deployer-list source

3 participants