ci: collapse 5 required PR-time checks into a single Merge Gate verdict#867
ci: collapse 5 required PR-time checks into a single Merge Gate verdict#867danielmeppiel merged 2 commits intomainfrom
Conversation
Merge Gate now waits for ALL 5 PR-time required checks (Build & Test (Linux) from ci.yml, plus Build/Smoke/Integration/Release-Validate (Linux) stubs from ci-integration-pr-stub.yml) and emits one verdict. This is the Tide / bors pattern: one gate to rule them all. Once this PR merges, branch protection can be flipped to require ONLY 'Merge Gate / gate'. The other 5 checks remain informational. This decouples the protection ruleset from CI workflow topology -- adding or renaming an underlying check no longer requires a ruleset edit. Script changes: - EXPECTED_CHECK (single string) -> EXPECTED_CHECKS (comma-separated). - Per-check state in parallel indexed arrays (works on bash 3.2+, no associative arrays required). - Fail-fast on the first failing check (exit 1 with annotation). - Timeout categorizes missing (never appeared) vs stuck (appeared but did not complete) and emits distinct error messages. - Same exit-code semantics as before, applied to the aggregate. Workflow changes: - Pass the 5 expected check names via EXPECTED_CHECKS env. - 'Merge Gate / gate' is excluded from the wait list by construction (it would deadlock waiting for itself). Tested live against PR #862 head SHA (all 5 OK -> exit 0) and against SHA 0000... (all 5 missing -> exit 2 with clear error annotation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the Merge Gate CI orchestrator to become the single required PR-time verdict by aggregating all currently-required PR checks (Tier 1 + Tier 2 stub checks) via the GitHub Checks API, enabling branch protection to require only Merge Gate / gate.
Changes:
- Expand
merge_gate_wait.shfrom waiting on a single check to polling a comma-separated list of expected check names and aggregating them into one pass/fail result. - Update
merge-gate.ymlto pass the full list of required PR-time check names into the wait script. - Add a CHANGELOG entry describing the new single-authority Merge Gate model.
Show a summary per file
| File | Description |
|---|---|
CHANGELOG.md |
Adds an Unreleased entry describing Merge Gate aggregating all PR-time required checks into one verdict. |
.github/workflows/merge-gate.yml |
Passes EXPECTED_CHECKS (all 5 PR-time check names) to the gate wait script; updates header comments to reflect the single-authority model. |
.github/scripts/ci/merge_gate_wait.sh |
Implements parsing/polling/aggregation for multiple required checks with fail-fast behavior and timeout categorization. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 3
| - New `enterprise/governance-guide.md` documentation page: flagship governance reference for CISO / VPE / Platform Tech Lead audiences, covering enforcement points, bypass contract, failure semantics, air-gapped operation, rollout playbook, and known gaps. Trims duplicated content in `governance.md`, `apm-policy.md`, and `integrations/github-rulesets.md`. Adds `templates/apm-policy-starter.yml`. (#851) | ||
| - `apm install` now supports Azure DevOps AAD bearer-token auth via `az account get-access-token`, with PAT-first fallback for orgs that disable PAT creation. Closes #852 (#856) | ||
| - New CI safety net: `merge-gate.yml` orchestrator turns dropped `pull_request` webhook deliveries into clear red checks instead of stuck `Expected -- Waiting for status to be reported`. Triggers on both `pull_request` and `pull_request_target` for redundancy. (#865) (PR follow-up to #856 CI flake) | ||
| - `merge-gate.yml` now aggregates ALL PR-time required checks (`Build & Test (Linux)` + 4 stubs from `ci-integration-pr-stub.yml`) into a single `Merge Gate / gate` verdict. Branch protection requires only this single check, decoupling the ruleset from CI workflow topology (Tide / bors pattern). |
There was a problem hiding this comment.
Changelog entries are required to end with the PR number per the repo's Keep a Changelog convention. This new line should end with something like "(#<PR_NUMBER>)" (and keep it as a single line entry under [Unreleased]).
| poll_count=$((poll_count + 1)) | ||
| pending_count=0 | ||
|
|
||
| for i in "${!checks[@]}"; do | ||
| c="${checks[i]}" | ||
| [ "${check_status[i]}" = "pending" ] || continue | ||
| pending_count=$((pending_count + 1)) | ||
|
|
There was a problem hiding this comment.
pending_count is incremented for checks that were pending at the start of the poll iteration, even if they become ok during that same iteration. This means the script can sleep an extra POLL_SEC after the last required check completes. Consider recomputing pending count after updating check_status, or decrementing when a check transitions out of pending.
|
|
||
| # Per-check state held in two parallel indexed arrays (avoids bash 4+ | ||
| # associative arrays so the script also works on stock macOS bash 3.2). | ||
| # Status values: pending, ok, fail, missing |
There was a problem hiding this comment.
The comment says status values include "missing", but check_status is only ever set to pending, ok, or fail. Either update the comment to match reality, or set check_status[i]="missing" when a check has never been observed so the state model stays consistent.
| # Status values: pending, ok, fail, missing | |
| # Status values: pending, ok, fail |
Phase 2 complete: ruleset flipped
Before (6 required):
After (1 required):
The 5 underlying checks continue to run on every PR; they just become informational signals consumed by the gate. Any future check addition/rename is now a one-line edit to Rollback
|
The dual-trigger pattern (pull_request + pull_request_target with concurrency cancel-in-progress) shipped in #865 was over-engineered. It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one CANCELLED -- and branch protection's status-check rollup treats CANCELLED as failure, so PRs were silently BLOCKED unless an admin overrode (which masked the bug on #867). GitHub Actions has no primitive for 'either of these events succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js) accept this and use a single trigger. The cost: a dropped 'pull_request' webhook (rare; observed once on PR #856) requires manual recovery. Recovery paths now documented at top of file: - push empty commit - gh workflow run merge-gate.yml -f pr_number=NNN - close + reopen PR Replaces the dual-trigger + bootstrap-fetch dance with a clean two-job flow: resolve-sha (handles workflow_dispatch input or PR head) then gate (sparse checkout + run script). Same script, same exit codes, same EXPECTED_CHECKS env. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore: prepare v0.9.2 release Bumps version to 0.9.2 and finalizes CHANGELOG with one-line summaries for each PR merged since 0.9.1. Highlights: - ADO AAD bearer-token auth (#856) - Governance Guide + enterprise docs IA refactor (#851, #858) - Merge Gate orchestrator + single-authority aggregation (#865, #867) - Landing + first-package docs rewrite (#855, #866) - gh-aw imports migration (#864) - Custom-port surfacing fix (#804) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: simplify merge-gate to single pull_request trigger The dual-trigger pattern (pull_request + pull_request_target with concurrency cancel-in-progress) shipped in #865 was over-engineered. It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one CANCELLED -- and branch protection's status-check rollup treats CANCELLED as failure, so PRs were silently BLOCKED unless an admin overrode (which masked the bug on #867). GitHub Actions has no primitive for 'either of these events succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js) accept this and use a single trigger. The cost: a dropped 'pull_request' webhook (rare; observed once on PR #856) requires manual recovery. Recovery paths now documented at top of file: - push empty commit - gh workflow run merge-gate.yml -f pr_number=NNN - close + reopen PR Replaces the dual-trigger + bootstrap-fetch dance with a clean two-job flow: resolve-sha (handles workflow_dispatch input or PR head) then gate (sparse checkout + run script). Same script, same exit codes, same EXPECTED_CHECKS env. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: collapse merge-gate into a single job (one check-run in PR UI) The two-job split (resolve-sha + gate) created two visible check-runs. Inlining the SHA resolution as a step within the gate job leaves only one check-run -- 'Merge Gate / gate' -- on the PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* docs(ci): document merge-gate architecture and ruleset context gotcha
- Document merge-gate.yml as the single-authority PR-time aggregator
- Mark ci-integration-pr-stub.yml as DEPRECATED (slated for deletion)
- Renumber workflow list (now 6 entries, was misnumbered with two #3s)
- New section: Branch Protection & Required Checks
- Ruleset 'context' field MUST match check-run name ('gate'), not the
UI display string ('Merge Gate / gate'). Storing the display string
causes permanent 'Expected - Waiting for status to be reported' that
blocked PR #860 today
- Adding new required checks goes through EXPECTED_CHECKS in
merge-gate.yml, not the ruleset
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore(ci): remove deprecated PR-time stub workflow
The four PR stubs (Build/Smoke/Integration/Release Validation - Linux)
were a holdover from the pre-merge-gate model where branch protection
required each Tier 2 check name directly. After #867, branch protection
requires only the single 'gate' check from merge-gate.yml, so the stubs
are dead weight that fire on every PR for no reason.
Changes:
- Delete .github/workflows/ci-integration-pr-stub.yml
- Reduce EXPECTED_CHECKS in merge-gate.yml to just 'Build & Test (Linux)'
(the only PR-time check we still emit)
- Update merge-gate.yml + ci-integration.yml header comments
- Update cicd.instructions.md (drop DEPRECATED entry, renumber to 5
workflows)
- Drop stale CODEOWNERS reference to the deleted file
- CHANGELOG entry under [Unreleased] > Removed
Stacked on #874 (docs).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Why
PR #865 shipped Merge Gate but only watching one check (
Build & Test (Linux)). The 4 stub checks (Build (Linux),Smoke Test (Linux),Integration Tests (Linux),Release Validation (Linux)) remained separately required in branch protection. That's half the architecture.This PR finishes the job: Merge Gate aggregates all PR-time required checks into a single verdict. Branch protection then only needs to require
Merge Gate / gate. This is the Tide (Kubernetes) / bors pattern.Before vs after
What changes
merge_gate_wait.shEXPECTED_CHECK(single) ->EXPECTED_CHECKS(comma-separated list). Parallel indexed arrays for per-check state (no bash 4+ associative arrays required). Fail-fast on first failing check. Timeout categorizes missing vs stuck.merge-gate.ymlEXPECTED_CHECKSenv. Updated header comment to reflect single-authority model.CHANGELOG.mdWhy this is better
Merge Gate / gategreen". One question, one answer.EXPECTED_CHECKSinmerge-gate.yml. No ruleset edit, no admin permissions needed.Trade-offs (honest)
merge-gate.ymlitself, the PR is stuck. Mitigated by dual triggers (pull_request+pull_request_targetwith concurrency dedup) -- both webhook channels would have to drop simultaneously to brick a PR. Same trade Tide/bors make.EXPECTED_CHECKS. This is by design (single source of truth), but it does mean the gate can be out of sync with the ruleset if someone adds a required check via the UI without updating the workflow. Mitigation: the check list is in version control, so PR review catches drift.Test results (live)
shellcheck merge_gate_wait.shdb0099eOK (success)0000...0000SHA, 1m timeoutdeclare -A)Phase 2: flip plan (post-merge, manual via API)
Once you ack the merge, I will run:
After flip: only 1 required check (
Merge Gate / gate). The 5 underlying checks become informational and continue to run as today.Rollback
If the gate misbehaves post-flip, restore the prior 5 required-check list via the same API. The underlying workflows continue to run independently of branch protection, so reverting is a one-API-call operation with zero workflow changes needed.
Out of scope
ci.yml,ci-integration-pr-stub.yml,ci-integration.yml). They keep doing exactly what they do today.ci-integration-pr-stub.yml. The stubs still need to run for the merge-queue model (real work inci-integration.ymlonmerge_group). Only the required-check ruleset entry changes.