From 51955cd6a155e8d2fe830d37f674738e53243012 Mon Sep 17 00:00:00 2001 From: danielmeppiel Date: Thu, 23 Apr 2026 10:31:36 +0200 Subject: [PATCH] ci: aggregate all required PR-time checks into single Merge Gate verdict Merge Gate now waits for ALL 5 PR-time required checks (Build & Test (Linux) from ci.yml, plus Build/Smoke/Integration/Release-Validate (Linux) stubs from ci-integration-pr-stub.yml) and emits one verdict. This is the Tide / bors pattern: one gate to rule them all. Once this PR merges, branch protection can be flipped to require ONLY 'Merge Gate / gate'. The other 5 checks remain informational. This decouples the protection ruleset from CI workflow topology -- adding or renaming an underlying check no longer requires a ruleset edit. Script changes: - EXPECTED_CHECK (single string) -> EXPECTED_CHECKS (comma-separated). - Per-check state in parallel indexed arrays (works on bash 3.2+, no associative arrays required). - Fail-fast on the first failing check (exit 1 with annotation). - Timeout categorizes missing (never appeared) vs stuck (appeared but did not complete) and emits distinct error messages. - Same exit-code semantics as before, applied to the aggregate. Workflow changes: - Pass the 5 expected check names via EXPECTED_CHECKS env. - 'Merge Gate / gate' is excluded from the wait list by construction (it would deadlock waiting for itself). Tested live against PR #862 head SHA (all 5 OK -> exit 0) and against SHA 0000... (all 5 missing -> exit 2 with clear error annotation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/scripts/ci/merge_gate_wait.sh | 186 ++++++++++++++++++-------- .github/workflows/merge-gate.yml | 39 +++--- CHANGELOG.md | 1 + 3 files changed, 149 insertions(+), 77 deletions(-) diff --git a/.github/scripts/ci/merge_gate_wait.sh b/.github/scripts/ci/merge_gate_wait.sh index 6a56ddf1..c5b17d91 100755 --- a/.github/scripts/ci/merge_gate_wait.sh +++ b/.github/scripts/ci/merge_gate_wait.sh @@ -1,41 +1,47 @@ #!/usr/bin/env bash -# merge_gate_wait.sh -- poll the GitHub Checks API for an expected required -# check on a given SHA and emit a single pass/fail verdict. Used by -# .github/workflows/merge-gate.yml as the orchestrator's core logic. +# merge_gate_wait.sh -- poll the GitHub Checks API for a list of expected +# required checks on a given SHA and emit a single pass/fail verdict. Used +# by .github/workflows/merge-gate.yml as the orchestrator's core logic. # # Why this script exists: # GitHub's required-status-checks model is name-based, not workflow-based. # When the underlying workflow fails to dispatch (transient webhook -# delivery failure on `pull_request`), the required check stays in +# delivery failure on 'pull_request'), the required check stays in # "Expected -- Waiting" forever and the PR is silently stuck. This script # turns that ambiguous yellow into an unambiguous red after a bounded # liveness window, so reviewers see a real failure with a real message. # +# It also lets us collapse N separately-required checks into a single +# required gate (Tide / bors pattern). Branch protection only requires +# "Merge Gate / gate"; this script verifies all underlying checks. +# # Inputs (environment variables): -# GH_TOKEN required. Token with `checks:read` for the repo. +# GH_TOKEN required. Token with 'checks:read' for the repo. # REPO required. owner/repo (e.g. microsoft/apm). # SHA required. Head SHA of the PR. -# EXPECTED_CHECK optional. Check-run name to wait for. -# Default: "Build & Test (Linux)". +# EXPECTED_CHECKS required. Comma-separated list of check-run names to +# wait for. Whitespace around commas is trimmed. +# Example: "Build & Test (Linux),Build (Linux)" # TIMEOUT_MIN optional. Total wall-clock budget in minutes. # Default: 30. # POLL_SEC optional. Poll interval in seconds. Default: 30. # # Exit codes: -# 0 expected check completed with conclusion success | skipped | neutral -# 1 expected check completed with a failing conclusion -# 2 expected check never appeared within TIMEOUT_MIN (THE BUG we catch) -# 3 expected check appeared but did not complete within TIMEOUT_MIN +# 0 all expected checks completed with success | skipped | neutral +# 1 at least one expected check completed with a failing conclusion +# 2 at least one expected check never appeared within TIMEOUT_MIN +# (THE BUG we catch -- dropped 'pull_request' webhook) +# 3 at least one expected check appeared but did not complete in time # 4 invalid arguments / environment set -euo pipefail -EXPECTED_CHECK="${EXPECTED_CHECK:-Build & Test (Linux)}" +EXPECTED_CHECKS="${EXPECTED_CHECKS:-}" TIMEOUT_MIN="${TIMEOUT_MIN:-30}" POLL_SEC="${POLL_SEC:-30}" -if [ -z "${GH_TOKEN:-}" ] || [ -z "${REPO:-}" ] || [ -z "${SHA:-}" ]; then - echo "ERROR: GH_TOKEN, REPO, and SHA are required." >&2 +if [ -z "${GH_TOKEN:-}" ] || [ -z "${REPO:-}" ] || [ -z "${SHA:-}" ] || [ -z "$EXPECTED_CHECKS" ]; then + echo "ERROR: GH_TOKEN, REPO, SHA, and EXPECTED_CHECKS are required." >&2 exit 4 fi @@ -49,68 +55,136 @@ if ! command -v jq >/dev/null 2>&1; then exit 4 fi +# Parse EXPECTED_CHECKS into an array (split on comma, trim whitespace). +declare -a checks=() +IFS=',' read -ra raw <<< "$EXPECTED_CHECKS" +for c in "${raw[@]}"; do + trimmed="$(echo "$c" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')" + [ -n "$trimmed" ] && checks+=("$trimmed") +done + +if [ "${#checks[@]}" -eq 0 ]; then + echo "ERROR: EXPECTED_CHECKS parsed to an empty list." >&2 + exit 4 +fi + +# Per-check state held in two parallel indexed arrays (avoids bash 4+ +# associative arrays so the script also works on stock macOS bash 3.2). +# Status values: pending, ok, fail, missing +declare -a check_status=() +declare -a check_url=() +for _ in "${checks[@]}"; do + check_status+=("pending") + check_url+=("") +done + deadline=$(( $(date +%s) + TIMEOUT_MIN * 60 )) poll_count=0 -ever_seen="false" -echo "[merge-gate] waiting for check '${EXPECTED_CHECK}' on ${REPO}@${SHA}" +echo "[merge-gate] waiting for ${#checks[@]} check(s) on ${REPO}@${SHA}" +for c in "${checks[@]}"; do + echo "[merge-gate] - ${c}" +done echo "[merge-gate] timeout=${TIMEOUT_MIN}m poll=${POLL_SEC}s" while [ "$(date +%s)" -lt "$deadline" ]; do poll_count=$((poll_count + 1)) + pending_count=0 + + for i in "${!checks[@]}"; do + c="${checks[i]}" + [ "${check_status[i]}" = "pending" ] || continue + pending_count=$((pending_count + 1)) + + # Filter by check-run name server-side. Most-recent first. + encoded=$(jq -rn --arg n "$c" '$n|@uri') + payload=$(gh api \ + -H "Accept: application/vnd.github+json" \ + "repos/${REPO}/commits/${SHA}/check-runs?check_name=${encoded}&per_page=10" \ + 2>/dev/null) || payload='{"check_runs":[]}' + + total=$(echo "$payload" | jq '.check_runs | length' 2>/dev/null || echo 0) + case "$total" in ''|*[!0-9]*) total=0 ;; esac + + if [ "$total" -eq 0 ]; then + echo "[merge-gate] poll #${poll_count}: '${c}' not yet present" + continue + fi - # Filter by check-run name server-side. Most-recent check-run is first. - payload=$(gh api \ - -H "Accept: application/vnd.github+json" \ - "repos/${REPO}/commits/${SHA}/check-runs?check_name=$(jq -rn --arg n "$EXPECTED_CHECK" '$n|@uri')&per_page=10" \ - 2>/dev/null) || payload='{"check_runs":[]}' - - total=$(echo "$payload" | jq '.check_runs | length' 2>/dev/null || echo 0) - case "$total" in - ''|*[!0-9]*) total=0 ;; - esac - - if [ "$total" -gt 0 ]; then - ever_seen="true" - # Take the most recently started run for this name. status=$(echo "$payload" | jq -r '.check_runs | sort_by(.started_at) | reverse | .[0].status') conclusion=$(echo "$payload" | jq -r '.check_runs | sort_by(.started_at) | reverse | .[0].conclusion') url=$(echo "$payload" | jq -r '.check_runs | sort_by(.started_at) | reverse | .[0].html_url') + check_url[i]="$url" - echo "[merge-gate] poll #${poll_count}: status=${status} conclusion=${conclusion}" - - if [ "$status" = "completed" ]; then - echo "[merge-gate] tier 1 finished: ${conclusion}" - echo "[merge-gate] details: ${url}" - case "$conclusion" in - success|skipped|neutral) - exit 0 - ;; - *) - echo "::error title=Tier 1 failed::'${EXPECTED_CHECK}' reported '${conclusion}'. See ${url}" - exit 1 - ;; - esac + if [ "$status" != "completed" ]; then + echo "[merge-gate] poll #${poll_count}: '${c}' status=${status}" + continue fi - else - echo "[merge-gate] poll #${poll_count}: '${EXPECTED_CHECK}' not yet present" + + case "$conclusion" in + success|skipped|neutral) + check_status[i]="ok" + echo "[merge-gate] poll #${poll_count}: '${c}' OK (${conclusion})" + ;; + *) + check_status[i]="fail" + echo "[merge-gate] poll #${poll_count}: '${c}' FAILED (${conclusion})" + echo "::error title=Required check failed::'${c}' reported '${conclusion}'. See ${url}" + # Fail fast: one failed check is enough to block the gate. + exit 1 + ;; + esac + done + + if [ "$pending_count" -eq 0 ]; then + echo "[merge-gate] all ${#checks[@]} check(s) completed successfully" + exit 0 fi sleep "$POLL_SEC" done -if [ "$ever_seen" = "false" ]; then - cat <&2 -::error title=Tier 1 never started::The required check '${EXPECTED_CHECK}' did not appear for SHA ${SHA} within ${TIMEOUT_MIN} minutes. - -This usually indicates a transient GitHub Actions webhook delivery failure for the 'pull_request' event. Recovery: - 1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push - 2. If that fails, close and reopen the PR. +# Timeout reached. Categorize what's missing vs stuck. +missing=() +stuck=() +for i in "${!checks[@]}"; do + c="${checks[i]}" + case "${check_status[i]}" in + pending) + if [ -z "${check_url[i]}" ]; then + missing+=("$c") + else + stuck+=("$c") + fi + ;; + esac +done -This gate (Merge Gate) catches the failure mode so it surfaces as a clear red check instead of a stuck 'Expected -- Waiting'. See .github/workflows/merge-gate.yml. -EOF +if [ "${#missing[@]}" -gt 0 ]; then + { + echo "::error title=Required check never started::The following check(s) did not appear for SHA ${SHA} within ${TIMEOUT_MIN} minutes:" + for c in "${missing[@]}"; do echo " - ${c}"; done + echo "" + echo "This usually indicates a transient GitHub Actions webhook delivery failure. Recovery:" + echo " 1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push" + echo " 2. If that fails, close and reopen the PR." + echo "" + echo "Merge Gate catches this failure mode so it surfaces as a clear red check instead of a stuck 'Expected -- Waiting'. See .github/workflows/merge-gate.yml." + } >&2 exit 2 fi -echo "::error title=Tier 1 timeout::Build & Test (Linux) appeared but did not complete within ${TIMEOUT_MIN} minutes." >&2 +{ + echo "::error title=Required check timeout::The following check(s) appeared but did not complete within ${TIMEOUT_MIN} minutes:" + for i in "${!stuck[@]}"; do + c="${stuck[i]}" + # Find the original index to look up the URL. + for j in "${!checks[@]}"; do + if [ "${checks[$j]}" = "$c" ]; then + echo " - ${c} -> ${check_url[$j]}" + break + fi + done + done +} >&2 exit 3 diff --git a/.github/workflows/merge-gate.yml b/.github/workflows/merge-gate.yml index 1605f8d4..cffa3058 100644 --- a/.github/workflows/merge-gate.yml +++ b/.github/workflows/merge-gate.yml @@ -1,6 +1,6 @@ -# Merge Gate -- shadow-mode orchestrator that aggregates required PR checks -# into a single verdict and turns "stuck pull_request webhook" failures into -# loud red checks instead of silent yellow "Expected -- Waiting" forever. +# Merge Gate -- single-authority orchestrator that aggregates ALL required +# PR-time checks into one verdict. Branch protection requires only this +# check; this workflow verifies all underlying checks via the Checks API. # # Why this file exists: # GitHub's required-status-checks model is name-based, not workflow-based. @@ -11,27 +11,19 @@ # `pull_request` event is dropped (transient, observed on PR #856), 4/5 # stubs go green and the 5th hangs in "Expected -- Waiting" indefinitely. # -# This workflow eventually replaces the per-test required checks with a -# single `Merge Gate / gate` check that: +# This workflow collapses N separately-required checks into a single +# `Merge Gate / gate` check that: # - dispatches via two redundant triggers (pull_request + # pull_request_target) so a single dropped delivery is recoverable; -# - polls the Checks API for the real Tier 1 check and aggregates; -# - times out cleanly with a clear error message if Tier 1 never fires -# (the bug we are catching); -# - is the SOLE required check after rollout, decoupling branch -# protection from workflow topology. -# -# Rollout plan: -# Phase 1 (this PR): workflow runs in shadow mode -- not required. -# Observe behaviour on real PRs for >=1 week. -# Phase 2 (post-merge): flip branch protection to require only -# `Merge Gate / gate` and drop the four stub names. -# Stub workflow can then be deleted. +# - polls the Checks API for ALL underlying required checks; +# - exits red if any check fails, never appears, or never completes; +# - is the SOLE required check, decoupling branch protection from +# workflow topology (Tide / bors pattern). # # Security: # `pull_request_target` is used here for redundancy ONLY. This workflow -# never checks out PR code, never interpolates PR data into `run:`, and -# has read-only token permissions. The classic +# never checks out PR code under that trigger, never interpolates PR data +# into `run:`, and has read-only token permissions. The classic # pull_request_target+checkout(head) exploit is impossible by construction. # See ci-integration-pr-stub.yml for the same security model. @@ -106,12 +98,17 @@ jobs: fi chmod +x .github/scripts/ci/merge_gate_wait.sh - - name: Wait for Tier 1 (Build & Test Linux) + - name: Wait for all required checks env: GH_TOKEN: ${{ github.token }} REPO: ${{ github.repository }} SHA: ${{ github.event.pull_request.head.sha }} - EXPECTED_CHECK: 'Build & Test (Linux)' + # All PR-time checks the gate aggregates. Keep this in sync with + # the underlying workflows: ci.yml emits Build & Test (Linux), + # ci-integration-pr-stub.yml emits the other four. + # NOTE: 'Merge Gate / gate' itself MUST NOT appear here -- it + # would deadlock waiting for itself. + EXPECTED_CHECKS: 'Build & Test (Linux),Build (Linux),Smoke Test (Linux),Integration Tests (Linux),Release Validation (Linux)' TIMEOUT_MIN: '30' POLL_SEC: '30' run: | diff --git a/CHANGELOG.md b/CHANGELOG.md index d33ac4d7..dc0a248f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - New `enterprise/governance-guide.md` documentation page: flagship governance reference for CISO / VPE / Platform Tech Lead audiences, covering enforcement points, bypass contract, failure semantics, air-gapped operation, rollout playbook, and known gaps. Trims duplicated content in `governance.md`, `apm-policy.md`, and `integrations/github-rulesets.md`. Adds `templates/apm-policy-starter.yml`. (#851) - `apm install` now supports Azure DevOps AAD bearer-token auth via `az account get-access-token`, with PAT-first fallback for orgs that disable PAT creation. Closes #852 (#856) - New CI safety net: `merge-gate.yml` orchestrator turns dropped `pull_request` webhook deliveries into clear red checks instead of stuck `Expected -- Waiting for status to be reported`. Triggers on both `pull_request` and `pull_request_target` for redundancy. (#865) (PR follow-up to #856 CI flake) +- `merge-gate.yml` now aggregates ALL PR-time required checks (`Build & Test (Linux)` + 4 stubs from `ci-integration-pr-stub.yml`) into a single `Merge Gate / gate` verdict. Branch protection requires only this single check, decoupling the ruleset from CI workflow topology (Tide / bors pattern). ## [0.9.1] - 2026-04-22