Skip to content

run-sweep: gate full-sweep PRs behind a sequential canary#1503

Open
Oseltamivir wants to merge 3 commits into
mainfrom
sweep-canary-gate
Open

run-sweep: gate full-sweep PRs behind a sequential canary#1503
Oseltamivir wants to merge 3 commits into
mainfrom
sweep-canary-gate

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented May 18, 2026

Summary

When a PR carries full-sweep-enabled (and not evals-only), pick the lowest-conc single-node benchmark entry as a canary and run it before fanning out the full sweep. If the canary fails, the eight fan-out jobs are skipped to save cluster time on shared failures (bad image tag, removed CLI flag, etc.).

Design choices:

  • Canary candidacy is restricted to single_node['1k1k' | '8k1k'] and excludes entries with run-eval: true, so the canary is always a pure benchmark smoke test using the existing single-node template.
  • The canary entry is removed from the regular fan-out's matrix (via remaining-search-space-config) only when the canary actually succeeded. On canary skip / cancel / canary-select failure, the regular fan-out falls back to the full search-space-config so coverage is preserved.
  • The fan-out gate proceeds only on canary-sweep.result == 'success' || == 'skipped' (allowlist, not != 'failure'), so a cancelled canary also blocks fan-out — preventing the case where in-flight fan-out jobs continue after the user explicitly cancels the canary.
  • A new non-canary-full-sweep-enabled label runs the full sweep without the canary gate (escape hatch when the canary is flaky or unrepresentative). The three sweep labels are now mutually exclusive.
  • Non-full-sweep PRs, draft PRs, pushes to main, and the reuse path all behave identically to before via existing gates.

The aggregated results_bmk artifact picks up both the canary's row and the regular fan-out's rows via the existing bmk_* glob — each entry appears exactly once.

Docker-tag-monitor research block split out to #1538.

Test plan

Each test ran sequentially on this branch via PR synchronize events. gptoss-fp4-h100-vllm was used as the changelog target for T1/T2/T4 (h100 has shorter CI queue than b200/b300); kimik2.5-int4-h100-vllm (agentic-only) was used for T3.

  • T1 — Canary success → fan-out runs with dedup (run 26467367124)

    • Result: canary-sweep ran the lowest-conc 1k1k entry (tp=2, conc=4) and succeeded. Downstream single-node-1k1k ran against remaining-search-space-config (the canary entry de-duped). Final tally: 41 success, 7 skipped, 0 failures.
    • Catches: jq remove_one correctness and the (canary == success && remaining) || full matrix ternary.
  • T2 — Canary failure → fan-out skipped (run 26471353341)

    • Setup: image override to vllm/vllm-openai:nonexistent-tag-T2-canary-failure-test.
    • Result: canary-sweep.result == 'failure'; all 8 fan-out jobs (single-node 1k1k/8k1k, single-node agentic/eval, all multi-node variants) and collect-results correctly skipped via the success || skipped allowlist.
    • Catches: the value-prop of this PR — gate actually blocks fan-out on shared failure.
  • T3 — No canary candidate → fan-out on full search space (run 26471597565)

    • Setup: switched changelog target to kimik2.5-int4-h100-vllm (agentic-only config; no fixed-seq-len entries → empty single_node['1k1k'] and ['8k1k'] arrays).
    • Result: canary-select succeeded but returned []; canary-sweep skipped via the != '' && != '[]' guard; agentic fan-out ran on the full search space (ternary fell through canary-sweep.result == 'success'); 1k1k/8k1k/eval/multi-node all correctly skipped because their matrices are empty for this config.
    • Catches: the fallback branch of the ternary — broken matrices wouldn't silently skip single-node sweeps.
  • T4 — full-sweep-enabled + evals-only → canary skipped via label gate (run 26475589608)

    • Result: canary-select skipped by its !contains(... 'evals-only') guard despite a valid canary candidate being available on gptoss-fp4-h100-vllm; canary-sweep skipped; fan-out matrices populated with the full search space.
    • Catches: the evals-only carve-out, which neither T1/T2 (canary runs) nor T3 (canary has no candidate) exercise.
  • T5 — Push to main → canary never fires

    • Method: workflow-logic inspection only (no live run).
    • Expected: canary-select.if requires github.event_name == 'pull_request', so on a push event it evaluates false → canary-sweep skipped → every sweep-* job runs via the success || skipped allowlist on the full search-space-config. Identical to pre-PR push-to-main behavior.
  • T8 — non-canary-full-sweep-enabled label → canary skipped, full fan-out (run 26471525429)

    • Setup: swap PR label from full-sweep-enabled to non-canary-full-sweep-enabled.
    • Result: canary-select skipped (its gate requires exact-match on full-sweep-enabled); canary-sweep skipped; fan-out matrices populated with the full search space (no dedup since canary-sweep.result != 'success').
    • Catches: the new label correctly takes the no-canary path AND the matrix ternary correctly falls through to search-space-config when canary is skipped.

Comment on lines +182 to +210
needs: canary-select
if: ${{ needs.canary-select.outputs.canary-config != '' && needs.canary-select.outputs.canary-config != '[]' }}
uses: ./.github/workflows/benchmark-tmpl.yml
name: canary /
strategy:
fail-fast: false
matrix:
config: ${{ fromJson(needs.canary-select.outputs.canary-config) }}
secrets: inherit
with:
exp-name: ${{ matrix.config.exp-name }}
isl: ${{ matrix.config.isl }}
osl: ${{ matrix.config.osl }}
max-model-len: ${{ matrix.config.max-model-len }}
runner: ${{ matrix.config.runner }}
image: ${{ matrix.config.image }}
model: ${{ matrix.config.model }}
model-prefix: ${{ matrix.config.model-prefix }}
framework: ${{ matrix.config.framework }}
precision: ${{ matrix.config.precision }}
tp: ${{ matrix.config.tp }}
ep: ${{ matrix.config.ep }}
dp-attn: ${{ matrix.config.dp-attn }}
conc: ${{ matrix.config.conc }}
spec-decoding: ${{ matrix.config.spec-decoding }}
disagg: ${{ matrix.config.disagg }}
run-eval: false

sweep-multi-node-1k1k:
@Oseltamivir Oseltamivir changed the title Sequential canary for full-sweep PRs + pre-flight research in docker-tag-monitor run-sweep: gate full-sweep PRs behind a sequential canary May 20, 2026
Oseltamivir added a commit that referenced this pull request May 26, 2026
Per PR #1503 test plan (T1 — canary success → fan-out runs with dedup).
The minimaxm2.5-fp8-b200-vllm config has both single_node 1k1k and 8k1k
scenarios, so canary-select has lowest-conc candidates to pick from. The
master config image (vllm/vllm-openai:v0.21.0) is known-good, so the
canary is expected to succeed and the fan-out should run against
remaining-search-space-config (canary entry removed → no duplicate run).
Comment thread perf-changelog.yaml Outdated
Oseltamivir added a commit that referenced this pull request May 26, 2026
Per PR #1503 test plan (T1 — canary success → fan-out runs with dedup).
The minimaxm2.5-fp8-b200-vllm config has both single_node 1k1k and 8k1k
scenarios, so canary-select has lowest-conc candidates to pick from. The
master config image (vllm/vllm-openai:v0.21.0) is known-good, so the
canary is expected to succeed and the fan-out should run against
remaining-search-space-config (canary entry removed → no duplicate run).
always() &&
needs.setup.result == 'success' &&
needs.setup.outputs.reuse-enabled != 'true' &&
(needs.canary-sweep.result == 'success' || needs.canary-sweep.result == 'skipped') &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canary gate doesn't handle cancelled state as documented

Low Severity

The PR description states the fan-out gate "blocks only on canary-sweep.result == 'failure' — every other state (success, skipped, cancelled) proceeds, so a bug in the canary mechanism never blocks the rest of the sweep." However, the condition (needs.canary-sweep.result == 'success' || needs.canary-sweep.result == 'skipped') does not allow cancelled through. Using needs.canary-sweep.result != 'failure' would match the documented design intent and be more resilient.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0ba1cf7. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Oseltamivir added a commit that referenced this pull request May 26, 2026
Per PR #1503 test plan (T1 — canary success → fan-out runs with dedup).
The minimaxm2.5-fp8-b200-vllm config has both single_node 1k1k and 8k1k
scenarios, so canary-select has lowest-conc candidates to pick from. The
master config image (vllm/vllm-openai:v0.21.0) is known-good, so the
canary is expected to succeed and the fan-out should run against
remaining-search-space-config (canary entry removed → no duplicate run).
@github-actions
Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed full-sweep-enabled labels May 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir added full-sweep-enabled and removed non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) labels May 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir added the evals-only Restrict full-sweep to eval matrices only; bypass canary smoke-test label May 26, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 759a193. Configure here.

Comment thread perf-changelog.yaml Outdated
- gptoss-fp4-h100-vllm
description:
- "[T4 / PR #1503 canary-gate test] Touch gptoss-fp4-h100-vllm (which has 1k1k/8k1k entries → valid canary candidate exists) and add `evals-only` label alongside `full-sweep-enabled`. Expected: canary-select SKIPPED via the `!contains(..., 'evals-only')` filter on line 151 — the canary suppression is label-driven, not absence-of-candidate. (Retrigger #2.)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1503
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test-only changelog entry committed to production file

Low Severity

The entry at the bottom of perf-changelog.yaml is clearly test scaffolding — its description contains "[T4 / PR #1503 canary-gate test]" and explains the expected test behavior. This will ship to main on merge and trigger an unnecessary benchmark sweep for gptoss-fp4-h100-vllm on any future workflow invocation that reads the changelog.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 759a193. Configure here.

@Oseltamivir Oseltamivir removed the evals-only Restrict full-sweep to eval matrices only; bypass canary smoke-test label May 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

When a PR carries `full-sweep-enabled` (and not `evals-only`), pick the
lowest-conc single-node benchmark entry as a canary and run it before
fanning out the full sweep. If the canary fails, the eight fan-out jobs
are skipped to save cluster time on shared failures (bad image tag,
removed CLI flag, etc.).

Design choices:
- Canary candidacy is restricted to single_node['1k1k' | '8k1k'] and
  excludes entries with run-eval: true, so the canary is always a pure
  benchmark smoke test using the existing single-node template.
- The canary entry is removed from the regular fan-out's matrix (via
  remaining-search-space-config) only when the canary actually succeeded.
  On canary skip / cancel / canary-select failure, the regular fan-out
  falls back to the full search-space-config so coverage is preserved.
- The fan-out gate blocks only on `canary-sweep.result == 'failure'` --
  every other state (success, skipped, cancelled) proceeds, so a bug in
  the canary mechanism never blocks the rest of the sweep.
- Non-full-sweep PRs, draft PRs, pushes to main, and the reuse path all
  behave identically to before via existing gates.

The aggregated results_bmk artifact picks up both the canary's row and
the regular fan-out's rows via the existing bmk_* glob -- each entry
appears exactly once.
Previously the fan-out gate was `needs.canary-sweep.result != 'failure'`,
which let `cancelled` (and any unknown future result) fall through. A
cancelled canary then ran the FULL fan-out matrix — including the canary's
own entry — without canary validation, the worst-case outcome.

Replace with an explicit allowlist:
  (result == 'success' || result == 'skipped')

- success: canary passed → fan-out runs with deduped (remaining) matrix
- skipped: no canary candidate (multi-node-only / evals-only) → fan-out
  runs with full matrix (no canary to dedup against)
- failure / cancelled / anything else: fan-out blocked

The matrix-ternary `result == 'success' && remaining-search-space-config
|| full-search-space-config` already had the right shape and is untouched.

Applies to all 8 fan-out jobs (single/multi-node 1k1k+8k1k, agentic,
multi-node-agentic, evals, multi-node-evals).
…nary)

New label that triggers the full sweep without running the canary gate.
Acts as an escape hatch when:
- The canary is known to be flaky / unreliable for this config
- The user wants the full sweep without the canary delay
- The canary's chosen entry is not a representative smoke test

Behavior matrix:
  sweep-enabled                  - trims to max(conc) per parallelism, with canary
  full-sweep-enabled             - full intermediate conc sweep, with canary
  non-canary-full-sweep-enabled  - full intermediate conc sweep, NO canary

How it works (no changes to existing label semantics):
- Setup gate triggers on any of the 3 labels (was: 2)
- canary-select gate still requires `full-sweep-enabled` (exact array match,
  so `non-canary-full-sweep-enabled` does NOT match) → canary skips → all
  fan-out jobs run on full search space via the `== 'skipped'` allowlist
- TRIM_CONC env is unchanged — only `sweep-enabled` enables trim, so the
  new label correctly behaves as "full sweep"
- The reject-conflicting-labels step is now a 3-way exclusion: at most
  one of {sweep-enabled, full-sweep-enabled, non-canary-full-sweep-enabled}
- The same gate updates apply to the comment-visualizer job
- Concurrency-group filter excludes the new label too so toggling it
  uses the same `'active'` group key as the other sweep labels
@Oseltamivir Oseltamivir added the evals-only Restrict full-sweep to eval matrices only; bypass canary smoke-test label May 26, 2026
@Oseltamivir Oseltamivir removed the evals-only Restrict full-sweep to eval matrices only; bypass canary smoke-test label May 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants