run-sweep: gate full-sweep PRs behind a sequential canary#1503
run-sweep: gate full-sweep PRs behind a sequential canary#1503Oseltamivir wants to merge 3 commits into
Conversation
| needs: canary-select | ||
| if: ${{ needs.canary-select.outputs.canary-config != '' && needs.canary-select.outputs.canary-config != '[]' }} | ||
| uses: ./.github/workflows/benchmark-tmpl.yml | ||
| name: canary / | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| config: ${{ fromJson(needs.canary-select.outputs.canary-config) }} | ||
| secrets: inherit | ||
| with: | ||
| exp-name: ${{ matrix.config.exp-name }} | ||
| isl: ${{ matrix.config.isl }} | ||
| osl: ${{ matrix.config.osl }} | ||
| max-model-len: ${{ matrix.config.max-model-len }} | ||
| runner: ${{ matrix.config.runner }} | ||
| image: ${{ matrix.config.image }} | ||
| model: ${{ matrix.config.model }} | ||
| model-prefix: ${{ matrix.config.model-prefix }} | ||
| framework: ${{ matrix.config.framework }} | ||
| precision: ${{ matrix.config.precision }} | ||
| tp: ${{ matrix.config.tp }} | ||
| ep: ${{ matrix.config.ep }} | ||
| dp-attn: ${{ matrix.config.dp-attn }} | ||
| conc: ${{ matrix.config.conc }} | ||
| spec-decoding: ${{ matrix.config.spec-decoding }} | ||
| disagg: ${{ matrix.config.disagg }} | ||
| run-eval: false | ||
|
|
||
| sweep-multi-node-1k1k: |
28ecd9a to
78b51e3
Compare
Per PR #1503 test plan (T1 — canary success → fan-out runs with dedup). The minimaxm2.5-fp8-b200-vllm config has both single_node 1k1k and 8k1k scenarios, so canary-select has lowest-conc candidates to pick from. The master config image (vllm/vllm-openai:v0.21.0) is known-good, so the canary is expected to succeed and the fan-out should run against remaining-search-space-config (canary entry removed → no duplicate run).
Per PR #1503 test plan (T1 — canary success → fan-out runs with dedup). The minimaxm2.5-fp8-b200-vllm config has both single_node 1k1k and 8k1k scenarios, so canary-select has lowest-conc candidates to pick from. The master config image (vllm/vllm-openai:v0.21.0) is known-good, so the canary is expected to succeed and the fan-out should run against remaining-search-space-config (canary entry removed → no duplicate run).
65d7856 to
0c1a94d
Compare
| always() && | ||
| needs.setup.result == 'success' && | ||
| needs.setup.outputs.reuse-enabled != 'true' && | ||
| (needs.canary-sweep.result == 'success' || needs.canary-sweep.result == 'skipped') && |
There was a problem hiding this comment.
Canary gate doesn't handle cancelled state as documented
Low Severity
The PR description states the fan-out gate "blocks only on canary-sweep.result == 'failure' — every other state (success, skipped, cancelled) proceeds, so a bug in the canary mechanism never blocks the rest of the sweep." However, the condition (needs.canary-sweep.result == 'success' || needs.canary-sweep.result == 'skipped') does not allow cancelled through. Using needs.canary-sweep.result != 'failure' would match the documented design intent and be more resilient.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 0ba1cf7. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26467367124 |
Per PR #1503 test plan (T1 — canary success → fan-out runs with dedup). The minimaxm2.5-fp8-b200-vllm config has both single_node 1k1k and 8k1k scenarios, so canary-select has lowest-conc candidates to pick from. The master config image (vllm/vllm-openai:v0.21.0) is known-good, so the canary is expected to succeed and the fan-out should run against remaining-search-space-config (canary entry removed → no duplicate run).
9286ade to
de73181
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26471353341 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26471519153 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26471525429 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 759a193. Configure here.
| - gptoss-fp4-h100-vllm | ||
| description: | ||
| - "[T4 / PR #1503 canary-gate test] Touch gptoss-fp4-h100-vllm (which has 1k1k/8k1k entries → valid canary candidate exists) and add `evals-only` label alongside `full-sweep-enabled`. Expected: canary-select SKIPPED via the `!contains(..., 'evals-only')` filter on line 151 — the canary suppression is label-driven, not absence-of-candidate. (Retrigger #2.)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1503 |
There was a problem hiding this comment.
Test-only changelog entry committed to production file
Low Severity
The entry at the bottom of perf-changelog.yaml is clearly test scaffolding — its description contains "[T4 / PR #1503 canary-gate test]" and explains the expected test behavior. This will ship to main on merge and trigger an unnecessary benchmark sweep for gptoss-fp4-h100-vllm on any future workflow invocation that reads the changelog.
Reviewed by Cursor Bugbot for commit 759a193. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26471597565 |
When a PR carries `full-sweep-enabled` (and not `evals-only`), pick the lowest-conc single-node benchmark entry as a canary and run it before fanning out the full sweep. If the canary fails, the eight fan-out jobs are skipped to save cluster time on shared failures (bad image tag, removed CLI flag, etc.). Design choices: - Canary candidacy is restricted to single_node['1k1k' | '8k1k'] and excludes entries with run-eval: true, so the canary is always a pure benchmark smoke test using the existing single-node template. - The canary entry is removed from the regular fan-out's matrix (via remaining-search-space-config) only when the canary actually succeeded. On canary skip / cancel / canary-select failure, the regular fan-out falls back to the full search-space-config so coverage is preserved. - The fan-out gate blocks only on `canary-sweep.result == 'failure'` -- every other state (success, skipped, cancelled) proceeds, so a bug in the canary mechanism never blocks the rest of the sweep. - Non-full-sweep PRs, draft PRs, pushes to main, and the reuse path all behave identically to before via existing gates. The aggregated results_bmk artifact picks up both the canary's row and the regular fan-out's rows via the existing bmk_* glob -- each entry appears exactly once.
Previously the fan-out gate was `needs.canary-sweep.result != 'failure'`, which let `cancelled` (and any unknown future result) fall through. A cancelled canary then ran the FULL fan-out matrix — including the canary's own entry — without canary validation, the worst-case outcome. Replace with an explicit allowlist: (result == 'success' || result == 'skipped') - success: canary passed → fan-out runs with deduped (remaining) matrix - skipped: no canary candidate (multi-node-only / evals-only) → fan-out runs with full matrix (no canary to dedup against) - failure / cancelled / anything else: fan-out blocked The matrix-ternary `result == 'success' && remaining-search-space-config || full-search-space-config` already had the right shape and is untouched. Applies to all 8 fan-out jobs (single/multi-node 1k1k+8k1k, agentic, multi-node-agentic, evals, multi-node-evals).
…nary)
New label that triggers the full sweep without running the canary gate.
Acts as an escape hatch when:
- The canary is known to be flaky / unreliable for this config
- The user wants the full sweep without the canary delay
- The canary's chosen entry is not a representative smoke test
Behavior matrix:
sweep-enabled - trims to max(conc) per parallelism, with canary
full-sweep-enabled - full intermediate conc sweep, with canary
non-canary-full-sweep-enabled - full intermediate conc sweep, NO canary
How it works (no changes to existing label semantics):
- Setup gate triggers on any of the 3 labels (was: 2)
- canary-select gate still requires `full-sweep-enabled` (exact array match,
so `non-canary-full-sweep-enabled` does NOT match) → canary skips → all
fan-out jobs run on full search space via the `== 'skipped'` allowlist
- TRIM_CONC env is unchanged — only `sweep-enabled` enables trim, so the
new label correctly behaves as "full sweep"
- The reject-conflicting-labels step is now a 3-way exclusion: at most
one of {sweep-enabled, full-sweep-enabled, non-canary-full-sweep-enabled}
- The same gate updates apply to the comment-visualizer job
- Concurrency-group filter excludes the new label too so toggling it
uses the same `'active'` group key as the other sweep labels
2f6aa0c to
afb408b
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26475589608 |


Summary
When a PR carries
full-sweep-enabled(and notevals-only), pick the lowest-conc single-node benchmark entry as a canary and run it before fanning out the full sweep. If the canary fails, the eight fan-out jobs are skipped to save cluster time on shared failures (bad image tag, removed CLI flag, etc.).Design choices:
single_node['1k1k' | '8k1k']and excludes entries withrun-eval: true, so the canary is always a pure benchmark smoke test using the existing single-node template.remaining-search-space-config) only when the canary actually succeeded. On canary skip / cancel / canary-select failure, the regular fan-out falls back to the full search-space-config so coverage is preserved.canary-sweep.result == 'success' || == 'skipped'(allowlist, not!= 'failure'), so a cancelled canary also blocks fan-out — preventing the case where in-flight fan-out jobs continue after the user explicitly cancels the canary.non-canary-full-sweep-enabledlabel runs the full sweep without the canary gate (escape hatch when the canary is flaky or unrepresentative). The three sweep labels are now mutually exclusive.The aggregated
results_bmkartifact picks up both the canary's row and the regular fan-out's rows via the existingbmk_*glob — each entry appears exactly once.Docker-tag-monitor research block split out to #1538.
Test plan
Each test ran sequentially on this branch via PR
synchronizeevents.gptoss-fp4-h100-vllmwas used as the changelog target for T1/T2/T4 (h100 has shorter CI queue than b200/b300);kimik2.5-int4-h100-vllm(agentic-only) was used for T3.T1 — Canary success → fan-out runs with dedup (run 26467367124)
canary-sweepran the lowest-conc 1k1k entry (tp=2, conc=4) and succeeded. Downstreamsingle-node-1k1kran againstremaining-search-space-config(the canary entry de-duped). Final tally: 41 success, 7 skipped, 0 failures.remove_onecorrectness and the(canary == success && remaining) || fullmatrix ternary.T2 — Canary failure → fan-out skipped (run 26471353341)
vllm/vllm-openai:nonexistent-tag-T2-canary-failure-test.canary-sweep.result == 'failure'; all 8 fan-out jobs (single-node 1k1k/8k1k, single-node agentic/eval, all multi-node variants) andcollect-resultscorrectly skipped via thesuccess || skippedallowlist.T3 — No canary candidate → fan-out on full search space (run 26471597565)
kimik2.5-int4-h100-vllm(agentic-only config; no fixed-seq-len entries → emptysingle_node['1k1k']and['8k1k']arrays).canary-selectsucceeded but returned[];canary-sweepskipped via the!= '' && != '[]'guard;agenticfan-out ran on the full search space (ternary fell throughcanary-sweep.result == 'success'); 1k1k/8k1k/eval/multi-node all correctly skipped because their matrices are empty for this config.T4 —
full-sweep-enabled+evals-only→ canary skipped via label gate (run 26475589608)canary-selectskipped by its!contains(... 'evals-only')guard despite a valid canary candidate being available ongptoss-fp4-h100-vllm;canary-sweepskipped; fan-out matrices populated with the full search space.T5 — Push to
main→ canary never firescanary-select.ifrequiresgithub.event_name == 'pull_request', so on apushevent it evaluates false →canary-sweepskipped → everysweep-*job runs via thesuccess || skippedallowlist on the full search-space-config. Identical to pre-PR push-to-main behavior.T8 —
non-canary-full-sweep-enabledlabel → canary skipped, full fan-out (run 26471525429)full-sweep-enabledtonon-canary-full-sweep-enabled.canary-selectskipped (its gate requires exact-match onfull-sweep-enabled);canary-sweepskipped; fan-out matrices populated with the full search space (no dedup sincecanary-sweep.result != 'success').search-space-configwhen canary is skipped.