Skip to content

ci: collapse 5 required PR-time checks into a single Merge Gate verdict#867

Merged
danielmeppiel merged 2 commits intomainfrom
ci/merge-gate-aggregate
Apr 23, 2026
Merged

ci: collapse 5 required PR-time checks into a single Merge Gate verdict#867
danielmeppiel merged 2 commits intomainfrom
ci/merge-gate-aggregate

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

Why

PR #865 shipped Merge Gate but only watching one check (Build & Test (Linux)). The 4 stub checks (Build (Linux), Smoke Test (Linux), Integration Tests (Linux), Release Validation (Linux)) remained separately required in branch protection. That's half the architecture.

This PR finishes the job: Merge Gate aggregates all PR-time required checks into a single verdict. Branch protection then only needs to require Merge Gate / gate. This is the Tide (Kubernetes) / bors pattern.

Before vs after

                BEFORE                                           AFTER
  +-----------------------------+               +-----------------------------+
  | Branch protection requires: |               | Branch protection requires: |
  |   Build & Test (Linux)      |               |   Merge Gate / gate         |
  |   Build (Linux)             |               +--------------+--------------+
  |   Smoke Test (Linux)        |                              | aggregates
  |   Integration Tests (Linux) |                              v
  |   Release Validation (Linux)|               +-----------------------------+
  |   Merge Gate / gate         |               | merge_gate_wait.sh polls:   |
  +-----------------------------+               |   Build & Test (Linux)      |
       6 required checks                        |   Build (Linux)             |
                                                |   Smoke Test (Linux)        |
                                                |   Integration Tests (Linux) |
                                                |   Release Validation (Linux)|
                                                +-----------------------------+
                                                       1 required check
                                                       5 informational checks

What changes

File Change
merge_gate_wait.sh EXPECTED_CHECK (single) -> EXPECTED_CHECKS (comma-separated list). Parallel indexed arrays for per-check state (no bash 4+ associative arrays required). Fail-fast on first failing check. Timeout categorizes missing vs stuck.
merge-gate.yml Pass all 5 check names via EXPECTED_CHECKS env. Updated header comment to reflect single-authority model.
CHANGELOG.md One line.

Why this is better

  • Single source of truth. "Is this PR mergeable" = "is Merge Gate / gate green". One question, one answer.
  • Decoupled. Adding a new required check = edit EXPECTED_CHECKS in merge-gate.yml. No ruleset edit, no admin permissions needed.
  • Same recovery story everywhere. Any stuck check (not just Build & Test) gets the same clear red annotation with retry instructions.
  • Cleaner UI. PR check list shows one required + N informational instead of 6 mandatory.
  • Industry-standard pattern. Tide (Kubernetes) and bors (Rust) both work this way for the same reasons.

Trade-offs (honest)

  • Single point of failure. If GitHub drops the dispatch for merge-gate.yml itself, the PR is stuck. Mitigated by dual triggers (pull_request + pull_request_target with concurrency dedup) -- both webhook channels would have to drop simultaneously to brick a PR. Same trade Tide/bors make.
  • Adding a check requires editing EXPECTED_CHECKS. This is by design (single source of truth), but it does mean the gate can be out of sync with the ruleset if someone adds a required check via the UI without updating the workflow. Mitigation: the check list is in version control, so PR review catches drift.

Test results (live)

Test Inputs Result
shellcheck shellcheck merge_gate_wait.sh clean
Aggregate success All 5 names against PR #862 head SHA db0099e exit 0, all 5 reported OK (success)
Missing-check timeout 2 names against 0000...0000 SHA, 1m timeout exit 2, lists both names + recovery instructions
bash 3.2 compatibility macOS stock bash passes (parallel indexed arrays, no declare -A)
Self-bootstrap curl 404 fallback preserved from #865

Phase 2: flip plan (post-merge, manual via API)

Once you ack the merge, I will run:

# Read current ruleset
gh api repos/microsoft/apm/rulesets/9294522 > /tmp/ruleset.json

# New required_status_checks: ONLY Merge Gate / gate
jq '(.rules[] | select(.type=="required_status_checks") | .parameters.required_status_checks) =
  [{"context":"Merge Gate / gate","integration_id":15368}]' /tmp/ruleset.json > /tmp/ruleset_new.json

jq '{name, target, enforcement, bypass_actors, conditions, rules}' /tmp/ruleset_new.json \
  | gh api -X PUT repos/microsoft/apm/rulesets/9294522 --input -

After flip: only 1 required check (Merge Gate / gate). The 5 underlying checks become informational and continue to run as today.

Rollback

If the gate misbehaves post-flip, restore the prior 5 required-check list via the same API. The underlying workflows continue to run independently of branch protection, so reverting is a one-API-call operation with zero workflow changes needed.

Out of scope

  • Changes to underlying CI workflows (ci.yml, ci-integration-pr-stub.yml, ci-integration.yml). They keep doing exactly what they do today.
  • Deleting ci-integration-pr-stub.yml. The stubs still need to run for the merge-queue model (real work in ci-integration.yml on merge_group). Only the required-check ruleset entry changes.

Merge Gate now waits for ALL 5 PR-time required checks (Build & Test
(Linux) from ci.yml, plus Build/Smoke/Integration/Release-Validate
(Linux) stubs from ci-integration-pr-stub.yml) and emits one verdict.
This is the Tide / bors pattern: one gate to rule them all.

Once this PR merges, branch protection can be flipped to require ONLY
'Merge Gate / gate'. The other 5 checks remain informational. This
decouples the protection ruleset from CI workflow topology -- adding
or renaming an underlying check no longer requires a ruleset edit.

Script changes:
- EXPECTED_CHECK (single string) -> EXPECTED_CHECKS (comma-separated).
- Per-check state in parallel indexed arrays (works on bash 3.2+, no
  associative arrays required).
- Fail-fast on the first failing check (exit 1 with annotation).
- Timeout categorizes missing (never appeared) vs stuck (appeared but
  did not complete) and emits distinct error messages.
- Same exit-code semantics as before, applied to the aggregate.

Workflow changes:
- Pass the 5 expected check names via EXPECTED_CHECKS env.
- 'Merge Gate / gate' is excluded from the wait list by construction
  (it would deadlock waiting for itself).

Tested live against PR #862 head SHA (all 5 OK -> exit 0) and against
SHA 0000... (all 5 missing -> exit 2 with clear error annotation).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 23, 2026 08:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Merge Gate CI orchestrator to become the single required PR-time verdict by aggregating all currently-required PR checks (Tier 1 + Tier 2 stub checks) via the GitHub Checks API, enabling branch protection to require only Merge Gate / gate.

Changes:

  • Expand merge_gate_wait.sh from waiting on a single check to polling a comma-separated list of expected check names and aggregating them into one pass/fail result.
  • Update merge-gate.yml to pass the full list of required PR-time check names into the wait script.
  • Add a CHANGELOG entry describing the new single-authority Merge Gate model.
Show a summary per file
File Description
CHANGELOG.md Adds an Unreleased entry describing Merge Gate aggregating all PR-time required checks into one verdict.
.github/workflows/merge-gate.yml Passes EXPECTED_CHECKS (all 5 PR-time check names) to the gate wait script; updates header comments to reflect the single-authority model.
.github/scripts/ci/merge_gate_wait.sh Implements parsing/polling/aggregation for multiple required checks with fail-fast behavior and timeout categorization.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 3

Comment thread CHANGELOG.md
- New `enterprise/governance-guide.md` documentation page: flagship governance reference for CISO / VPE / Platform Tech Lead audiences, covering enforcement points, bypass contract, failure semantics, air-gapped operation, rollout playbook, and known gaps. Trims duplicated content in `governance.md`, `apm-policy.md`, and `integrations/github-rulesets.md`. Adds `templates/apm-policy-starter.yml`. (#851)
- `apm install` now supports Azure DevOps AAD bearer-token auth via `az account get-access-token`, with PAT-first fallback for orgs that disable PAT creation. Closes #852 (#856)
- New CI safety net: `merge-gate.yml` orchestrator turns dropped `pull_request` webhook deliveries into clear red checks instead of stuck `Expected -- Waiting for status to be reported`. Triggers on both `pull_request` and `pull_request_target` for redundancy. (#865) (PR follow-up to #856 CI flake)
- `merge-gate.yml` now aggregates ALL PR-time required checks (`Build & Test (Linux)` + 4 stubs from `ci-integration-pr-stub.yml`) into a single `Merge Gate / gate` verdict. Branch protection requires only this single check, decoupling the ruleset from CI workflow topology (Tide / bors pattern).
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entries are required to end with the PR number per the repo's Keep a Changelog convention. This new line should end with something like "(#<PR_NUMBER>)" (and keep it as a single line entry under [Unreleased]).

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 91 to +98
poll_count=$((poll_count + 1))
pending_count=0

for i in "${!checks[@]}"; do
c="${checks[i]}"
[ "${check_status[i]}" = "pending" ] || continue
pending_count=$((pending_count + 1))

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending_count is incremented for checks that were pending at the start of the poll iteration, even if they become ok during that same iteration. This means the script can sleep an extra POLL_SEC after the last required check completes. Consider recomputing pending count after updating check_status, or decrementing when a check transitions out of pending.

Copilot uses AI. Check for mistakes.

# Per-check state held in two parallel indexed arrays (avoids bash 4+
# associative arrays so the script also works on stock macOS bash 3.2).
# Status values: pending, ok, fail, missing
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says status values include "missing", but check_status is only ever set to pending, ok, or fail. Either update the comment to match reality, or set check_status[i]="missing" when a check has never been observed so the state model stays consistent.

Suggested change
# Status values: pending, ok, fail, missing
# Status values: pending, ok, fail

Copilot uses AI. Check for mistakes.
@danielmeppiel danielmeppiel merged commit 523cc7a into main Apr 23, 2026
9 of 10 checks passed
@danielmeppiel danielmeppiel deleted the ci/merge-gate-aggregate branch April 23, 2026 08:39
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Phase 2 complete: ruleset flipped

main-protection ruleset (id 9294522) now requires only one status check:

Before (6 required):

  • Build & Test (Linux)
  • Build (Linux)
  • Smoke Test (Linux)
  • Integration Tests (Linux)
  • Release Validation (Linux)
  • Merge Gate / gate

After (1 required):

  • Merge Gate / gate

The 5 underlying checks continue to run on every PR; they just become informational signals consumed by the gate. Any future check addition/rename is now a one-line edit to EXPECTED_CHECKS in merge-gate.yml -- no admin permission, no ruleset edit.

Rollback

gh api -X PUT repos/microsoft/apm/rulesets/9294522 --input <file> with the previous 6-check list.

danielmeppiel added a commit that referenced this pull request Apr 23, 2026
The dual-trigger pattern (pull_request + pull_request_target with
concurrency cancel-in-progress) shipped in #865 was over-engineered.
It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one
CANCELLED -- and branch protection's status-check rollup treats
CANCELLED as failure, so PRs were silently BLOCKED unless an admin
overrode (which masked the bug on #867).

GitHub Actions has no primitive for 'either of these events
succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js)
accept this and use a single trigger. The cost: a dropped 'pull_request'
webhook (rare; observed once on PR #856) requires manual recovery.

Recovery paths now documented at top of file:
  - push empty commit
  - gh workflow run merge-gate.yml -f pr_number=NNN
  - close + reopen PR

Replaces the dual-trigger + bootstrap-fetch dance with a clean
two-job flow: resolve-sha (handles workflow_dispatch input or PR head)
then gate (sparse checkout + run script). Same script, same exit
codes, same EXPECTED_CHECKS env.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel added a commit that referenced this pull request Apr 23, 2026
* chore: prepare v0.9.2 release

Bumps version to 0.9.2 and finalizes CHANGELOG with one-line summaries
for each PR merged since 0.9.1.

Highlights:
- ADO AAD bearer-token auth (#856)
- Governance Guide + enterprise docs IA refactor (#851, #858)
- Merge Gate orchestrator + single-authority aggregation (#865, #867)
- Landing + first-package docs rewrite (#855, #866)
- gh-aw imports migration (#864)
- Custom-port surfacing fix (#804)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: simplify merge-gate to single pull_request trigger

The dual-trigger pattern (pull_request + pull_request_target with
concurrency cancel-in-progress) shipped in #865 was over-engineered.
It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one
CANCELLED -- and branch protection's status-check rollup treats
CANCELLED as failure, so PRs were silently BLOCKED unless an admin
overrode (which masked the bug on #867).

GitHub Actions has no primitive for 'either of these events
succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js)
accept this and use a single trigger. The cost: a dropped 'pull_request'
webhook (rare; observed once on PR #856) requires manual recovery.

Recovery paths now documented at top of file:
  - push empty commit
  - gh workflow run merge-gate.yml -f pr_number=NNN
  - close + reopen PR

Replaces the dual-trigger + bootstrap-fetch dance with a clean
two-job flow: resolve-sha (handles workflow_dispatch input or PR head)
then gate (sparse checkout + run script). Same script, same exit
codes, same EXPECTED_CHECKS env.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: collapse merge-gate into a single job (one check-run in PR UI)

The two-job split (resolve-sha + gate) created two visible check-runs.
Inlining the SHA resolution as a step within the gate job leaves only
one check-run -- 'Merge Gate / gate' -- on the PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel added a commit that referenced this pull request Apr 23, 2026
* docs(ci): document merge-gate architecture and ruleset context gotcha

- Document merge-gate.yml as the single-authority PR-time aggregator
- Mark ci-integration-pr-stub.yml as DEPRECATED (slated for deletion)
- Renumber workflow list (now 6 entries, was misnumbered with two #3s)
- New section: Branch Protection & Required Checks
  - Ruleset 'context' field MUST match check-run name ('gate'), not the
    UI display string ('Merge Gate / gate'). Storing the display string
    causes permanent 'Expected - Waiting for status to be reported' that
    blocked PR #860 today
  - Adding new required checks goes through EXPECTED_CHECKS in
    merge-gate.yml, not the ruleset

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore(ci): remove deprecated PR-time stub workflow

The four PR stubs (Build/Smoke/Integration/Release Validation - Linux)
were a holdover from the pre-merge-gate model where branch protection
required each Tier 2 check name directly. After #867, branch protection
requires only the single 'gate' check from merge-gate.yml, so the stubs
are dead weight that fire on every PR for no reason.

Changes:
- Delete .github/workflows/ci-integration-pr-stub.yml
- Reduce EXPECTED_CHECKS in merge-gate.yml to just 'Build & Test (Linux)'
  (the only PR-time check we still emit)
- Update merge-gate.yml + ci-integration.yml header comments
- Update cicd.instructions.md (drop DEPRECATED entry, renumber to 5
  workflows)
- Drop stale CODEOWNERS reference to the deleted file
- CHANGELOG entry under [Unreleased] > Removed

Stacked on #874 (docs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants