Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 130 additions & 56 deletions .github/scripts/ci/merge_gate_wait.sh
Original file line number Diff line number Diff line change
@@ -1,41 +1,47 @@
#!/usr/bin/env bash
# merge_gate_wait.sh -- poll the GitHub Checks API for an expected required
# check on a given SHA and emit a single pass/fail verdict. Used by
# .github/workflows/merge-gate.yml as the orchestrator's core logic.
# merge_gate_wait.sh -- poll the GitHub Checks API for a list of expected
# required checks on a given SHA and emit a single pass/fail verdict. Used
# by .github/workflows/merge-gate.yml as the orchestrator's core logic.
#
# Why this script exists:
# GitHub's required-status-checks model is name-based, not workflow-based.
# When the underlying workflow fails to dispatch (transient webhook
# delivery failure on `pull_request`), the required check stays in
# delivery failure on 'pull_request'), the required check stays in
# "Expected -- Waiting" forever and the PR is silently stuck. This script
# turns that ambiguous yellow into an unambiguous red after a bounded
# liveness window, so reviewers see a real failure with a real message.
#
# It also lets us collapse N separately-required checks into a single
# required gate (Tide / bors pattern). Branch protection only requires
# "Merge Gate / gate"; this script verifies all underlying checks.
#
# Inputs (environment variables):
# GH_TOKEN required. Token with `checks:read` for the repo.
# GH_TOKEN required. Token with 'checks:read' for the repo.
# REPO required. owner/repo (e.g. microsoft/apm).
# SHA required. Head SHA of the PR.
# EXPECTED_CHECK optional. Check-run name to wait for.
# Default: "Build & Test (Linux)".
# EXPECTED_CHECKS required. Comma-separated list of check-run names to
# wait for. Whitespace around commas is trimmed.
# Example: "Build & Test (Linux),Build (Linux)"
# TIMEOUT_MIN optional. Total wall-clock budget in minutes.
# Default: 30.
# POLL_SEC optional. Poll interval in seconds. Default: 30.
#
# Exit codes:
# 0 expected check completed with conclusion success | skipped | neutral
# 1 expected check completed with a failing conclusion
# 2 expected check never appeared within TIMEOUT_MIN (THE BUG we catch)
# 3 expected check appeared but did not complete within TIMEOUT_MIN
# 0 all expected checks completed with success | skipped | neutral
# 1 at least one expected check completed with a failing conclusion
# 2 at least one expected check never appeared within TIMEOUT_MIN
# (THE BUG we catch -- dropped 'pull_request' webhook)
# 3 at least one expected check appeared but did not complete in time
# 4 invalid arguments / environment

set -euo pipefail

EXPECTED_CHECK="${EXPECTED_CHECK:-Build & Test (Linux)}"
EXPECTED_CHECKS="${EXPECTED_CHECKS:-}"
TIMEOUT_MIN="${TIMEOUT_MIN:-30}"
POLL_SEC="${POLL_SEC:-30}"

if [ -z "${GH_TOKEN:-}" ] || [ -z "${REPO:-}" ] || [ -z "${SHA:-}" ]; then
echo "ERROR: GH_TOKEN, REPO, and SHA are required." >&2
if [ -z "${GH_TOKEN:-}" ] || [ -z "${REPO:-}" ] || [ -z "${SHA:-}" ] || [ -z "$EXPECTED_CHECKS" ]; then
echo "ERROR: GH_TOKEN, REPO, SHA, and EXPECTED_CHECKS are required." >&2
exit 4
fi

Expand All @@ -49,68 +55,136 @@ if ! command -v jq >/dev/null 2>&1; then
exit 4
fi

# Parse EXPECTED_CHECKS into an array (split on comma, trim whitespace).
declare -a checks=()
IFS=',' read -ra raw <<< "$EXPECTED_CHECKS"
for c in "${raw[@]}"; do
trimmed="$(echo "$c" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
[ -n "$trimmed" ] && checks+=("$trimmed")
done

if [ "${#checks[@]}" -eq 0 ]; then
echo "ERROR: EXPECTED_CHECKS parsed to an empty list." >&2
exit 4
fi

# Per-check state held in two parallel indexed arrays (avoids bash 4+
# associative arrays so the script also works on stock macOS bash 3.2).
# Status values: pending, ok, fail, missing
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says status values include "missing", but check_status is only ever set to pending, ok, or fail. Either update the comment to match reality, or set check_status[i]="missing" when a check has never been observed so the state model stays consistent.

Suggested change
# Status values: pending, ok, fail, missing
# Status values: pending, ok, fail

Copilot uses AI. Check for mistakes.
declare -a check_status=()
declare -a check_url=()
for _ in "${checks[@]}"; do
check_status+=("pending")
check_url+=("")
done

deadline=$(( $(date +%s) + TIMEOUT_MIN * 60 ))
poll_count=0
ever_seen="false"

echo "[merge-gate] waiting for check '${EXPECTED_CHECK}' on ${REPO}@${SHA}"
echo "[merge-gate] waiting for ${#checks[@]} check(s) on ${REPO}@${SHA}"
for c in "${checks[@]}"; do
echo "[merge-gate] - ${c}"
done
echo "[merge-gate] timeout=${TIMEOUT_MIN}m poll=${POLL_SEC}s"

while [ "$(date +%s)" -lt "$deadline" ]; do
poll_count=$((poll_count + 1))
pending_count=0

for i in "${!checks[@]}"; do
c="${checks[i]}"
[ "${check_status[i]}" = "pending" ] || continue
pending_count=$((pending_count + 1))

Comment on lines 91 to +98
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending_count is incremented for checks that were pending at the start of the poll iteration, even if they become ok during that same iteration. This means the script can sleep an extra POLL_SEC after the last required check completes. Consider recomputing pending count after updating check_status, or decrementing when a check transitions out of pending.

Copilot uses AI. Check for mistakes.
# Filter by check-run name server-side. Most-recent first.
encoded=$(jq -rn --arg n "$c" '$n|@uri')
payload=$(gh api \
-H "Accept: application/vnd.github+json" \
"repos/${REPO}/commits/${SHA}/check-runs?check_name=${encoded}&per_page=10" \
2>/dev/null) || payload='{"check_runs":[]}'

total=$(echo "$payload" | jq '.check_runs | length' 2>/dev/null || echo 0)
case "$total" in ''|*[!0-9]*) total=0 ;; esac

if [ "$total" -eq 0 ]; then
echo "[merge-gate] poll #${poll_count}: '${c}' not yet present"
continue
fi

# Filter by check-run name server-side. Most-recent check-run is first.
payload=$(gh api \
-H "Accept: application/vnd.github+json" \
"repos/${REPO}/commits/${SHA}/check-runs?check_name=$(jq -rn --arg n "$EXPECTED_CHECK" '$n|@uri')&per_page=10" \
2>/dev/null) || payload='{"check_runs":[]}'

total=$(echo "$payload" | jq '.check_runs | length' 2>/dev/null || echo 0)
case "$total" in
''|*[!0-9]*) total=0 ;;
esac

if [ "$total" -gt 0 ]; then
ever_seen="true"
# Take the most recently started run for this name.
status=$(echo "$payload" | jq -r '.check_runs | sort_by(.started_at) | reverse | .[0].status')
conclusion=$(echo "$payload" | jq -r '.check_runs | sort_by(.started_at) | reverse | .[0].conclusion')
url=$(echo "$payload" | jq -r '.check_runs | sort_by(.started_at) | reverse | .[0].html_url')
check_url[i]="$url"

echo "[merge-gate] poll #${poll_count}: status=${status} conclusion=${conclusion}"

if [ "$status" = "completed" ]; then
echo "[merge-gate] tier 1 finished: ${conclusion}"
echo "[merge-gate] details: ${url}"
case "$conclusion" in
success|skipped|neutral)
exit 0
;;
*)
echo "::error title=Tier 1 failed::'${EXPECTED_CHECK}' reported '${conclusion}'. See ${url}"
exit 1
;;
esac
if [ "$status" != "completed" ]; then
echo "[merge-gate] poll #${poll_count}: '${c}' status=${status}"
continue
fi
else
echo "[merge-gate] poll #${poll_count}: '${EXPECTED_CHECK}' not yet present"

case "$conclusion" in
success|skipped|neutral)
check_status[i]="ok"
echo "[merge-gate] poll #${poll_count}: '${c}' OK (${conclusion})"
;;
*)
check_status[i]="fail"
echo "[merge-gate] poll #${poll_count}: '${c}' FAILED (${conclusion})"
echo "::error title=Required check failed::'${c}' reported '${conclusion}'. See ${url}"
# Fail fast: one failed check is enough to block the gate.
exit 1
;;
esac
done

if [ "$pending_count" -eq 0 ]; then
echo "[merge-gate] all ${#checks[@]} check(s) completed successfully"
exit 0
fi

sleep "$POLL_SEC"
done

if [ "$ever_seen" = "false" ]; then
cat <<EOF >&2
::error title=Tier 1 never started::The required check '${EXPECTED_CHECK}' did not appear for SHA ${SHA} within ${TIMEOUT_MIN} minutes.

This usually indicates a transient GitHub Actions webhook delivery failure for the 'pull_request' event. Recovery:
1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push
2. If that fails, close and reopen the PR.
# Timeout reached. Categorize what's missing vs stuck.
missing=()
stuck=()
for i in "${!checks[@]}"; do
c="${checks[i]}"
case "${check_status[i]}" in
pending)
if [ -z "${check_url[i]}" ]; then
missing+=("$c")
else
stuck+=("$c")
fi
;;
esac
done

This gate (Merge Gate) catches the failure mode so it surfaces as a clear red check instead of a stuck 'Expected -- Waiting'. See .github/workflows/merge-gate.yml.
EOF
if [ "${#missing[@]}" -gt 0 ]; then
{
echo "::error title=Required check never started::The following check(s) did not appear for SHA ${SHA} within ${TIMEOUT_MIN} minutes:"
for c in "${missing[@]}"; do echo " - ${c}"; done
echo ""
echo "This usually indicates a transient GitHub Actions webhook delivery failure. Recovery:"
echo " 1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push"
echo " 2. If that fails, close and reopen the PR."
echo ""
echo "Merge Gate catches this failure mode so it surfaces as a clear red check instead of a stuck 'Expected -- Waiting'. See .github/workflows/merge-gate.yml."
} >&2
exit 2
fi

echo "::error title=Tier 1 timeout::Build & Test (Linux) appeared but did not complete within ${TIMEOUT_MIN} minutes." >&2
{
echo "::error title=Required check timeout::The following check(s) appeared but did not complete within ${TIMEOUT_MIN} minutes:"
for i in "${!stuck[@]}"; do
c="${stuck[i]}"
# Find the original index to look up the URL.
for j in "${!checks[@]}"; do
if [ "${checks[$j]}" = "$c" ]; then
echo " - ${c} -> ${check_url[$j]}"
break
fi
done
done
} >&2
exit 3
39 changes: 18 additions & 21 deletions .github/workflows/merge-gate.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Merge Gate -- shadow-mode orchestrator that aggregates required PR checks
# into a single verdict and turns "stuck pull_request webhook" failures into
# loud red checks instead of silent yellow "Expected -- Waiting" forever.
# Merge Gate -- single-authority orchestrator that aggregates ALL required
# PR-time checks into one verdict. Branch protection requires only this
# check; this workflow verifies all underlying checks via the Checks API.
#
# Why this file exists:
# GitHub's required-status-checks model is name-based, not workflow-based.
Expand All @@ -11,27 +11,19 @@
# `pull_request` event is dropped (transient, observed on PR #856), 4/5
# stubs go green and the 5th hangs in "Expected -- Waiting" indefinitely.
#
# This workflow eventually replaces the per-test required checks with a
# single `Merge Gate / gate` check that:
# This workflow collapses N separately-required checks into a single
# `Merge Gate / gate` check that:
# - dispatches via two redundant triggers (pull_request +
# pull_request_target) so a single dropped delivery is recoverable;
# - polls the Checks API for the real Tier 1 check and aggregates;
# - times out cleanly with a clear error message if Tier 1 never fires
# (the bug we are catching);
# - is the SOLE required check after rollout, decoupling branch
# protection from workflow topology.
#
# Rollout plan:
# Phase 1 (this PR): workflow runs in shadow mode -- not required.
# Observe behaviour on real PRs for >=1 week.
# Phase 2 (post-merge): flip branch protection to require only
# `Merge Gate / gate` and drop the four stub names.
# Stub workflow can then be deleted.
# - polls the Checks API for ALL underlying required checks;
# - exits red if any check fails, never appears, or never completes;
# - is the SOLE required check, decoupling branch protection from
# workflow topology (Tide / bors pattern).
#
# Security:
# `pull_request_target` is used here for redundancy ONLY. This workflow
# never checks out PR code, never interpolates PR data into `run:`, and
# has read-only token permissions. The classic
# never checks out PR code under that trigger, never interpolates PR data
# into `run:`, and has read-only token permissions. The classic
# pull_request_target+checkout(head) exploit is impossible by construction.
# See ci-integration-pr-stub.yml for the same security model.

Expand Down Expand Up @@ -106,12 +98,17 @@ jobs:
fi
chmod +x .github/scripts/ci/merge_gate_wait.sh

- name: Wait for Tier 1 (Build & Test Linux)
- name: Wait for all required checks
env:
GH_TOKEN: ${{ github.token }}
REPO: ${{ github.repository }}
SHA: ${{ github.event.pull_request.head.sha }}
EXPECTED_CHECK: 'Build & Test (Linux)'
# All PR-time checks the gate aggregates. Keep this in sync with
# the underlying workflows: ci.yml emits Build & Test (Linux),
# ci-integration-pr-stub.yml emits the other four.
# NOTE: 'Merge Gate / gate' itself MUST NOT appear here -- it
# would deadlock waiting for itself.
EXPECTED_CHECKS: 'Build & Test (Linux),Build (Linux),Smoke Test (Linux),Integration Tests (Linux),Release Validation (Linux)'
TIMEOUT_MIN: '30'
POLL_SEC: '30'
run: |
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- New `enterprise/governance-guide.md` documentation page: flagship governance reference for CISO / VPE / Platform Tech Lead audiences, covering enforcement points, bypass contract, failure semantics, air-gapped operation, rollout playbook, and known gaps. Trims duplicated content in `governance.md`, `apm-policy.md`, and `integrations/github-rulesets.md`. Adds `templates/apm-policy-starter.yml`. (#851)
- `apm install` now supports Azure DevOps AAD bearer-token auth via `az account get-access-token`, with PAT-first fallback for orgs that disable PAT creation. Closes #852 (#856)
- New CI safety net: `merge-gate.yml` orchestrator turns dropped `pull_request` webhook deliveries into clear red checks instead of stuck `Expected -- Waiting for status to be reported`. Triggers on both `pull_request` and `pull_request_target` for redundancy. (#865) (PR follow-up to #856 CI flake)
- `merge-gate.yml` now aggregates ALL PR-time required checks (`Build & Test (Linux)` + 4 stubs from `ci-integration-pr-stub.yml`) into a single `Merge Gate / gate` verdict. Branch protection requires only this single check, decoupling the ruleset from CI workflow topology (Tide / bors pattern).
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entries are required to end with the PR number per the repo's Keep a Changelog convention. This new line should end with something like "(#<PR_NUMBER>)" (and keep it as a single line entry under [Unreleased]).

Copilot generated this review using guidance from repository custom instructions.

## [0.9.1] - 2026-04-22

Expand Down
Loading