ci: add merge-gate orchestrator to harden against dropped pull_request webhooks#865
ci: add merge-gate orchestrator to harden against dropped pull_request webhooks#865danielmeppiel merged 5 commits intomainfrom
Conversation
PR #856 surfaced a structural CI fragility: required status checks are satisfied by two independent webhook channels (pull_request emits 'Build & Test (Linux)', pull_request_target emits the four Tier 2 stubs). When the pull_request delivery is dropped, 4/5 stubs go green and the 5th hangs in 'Expected -- Waiting' forever -- ambiguous yellow indistinguishable from a slow build. Recovery is folklore. This PR ships two safety nets in shadow mode: * .github/workflows/merge-gate.yml + scripts/ci/merge_gate_wait.sh Single orchestrator that polls the Checks API for 'Build & Test (Linux)' on the PR head SHA and aggregates into one verdict. Triggers on BOTH pull_request and pull_request_target so a single dropped delivery is recoverable; concurrency dedupes. Times out cleanly with a clear error message if Tier 1 never dispatched -- turning the invisible failure into a loud red check. NOT YET REQUIRED -- shadow observation first, ruleset flip after merge. * .github/workflows/watchdog-stuck-prs.yml + scripts/ci/watchdog_scan.sh Cron every 15 min. For open non-draft PRs with no ci.yml run on the head SHA AND non-paths-ignored files, posts one recovery comment. Backstop for any required check that stops dispatching. Tested live (dry-run) against microsoft/apm: watchdog correctly distinguishes stuck PRs (#853, #409) from docs-only PRs (#864, #461, #825). Both scripts shellcheck-clean. merge_gate_wait.sh tested end-to-end against PR #856 head SHA (success path, exit 0) and a non-existent SHA (timeout path, exit 2 with clear error annotation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes for the script-fetch step: 1. On 'pull_request' the runner has no secrets and a read-only token, so actions/checkout of PR head is safe -- use it for simplicity. We only need API-fetch under 'pull_request_target' where checkout would be a security risk. 2. On 'pull_request_target', when the script does not yet exist on base (i.e. THIS PR is the one adding it), curl returns 404 and we degrade to a self-bootstrap no-op pass instead of failing. Once the script lands on main, the gate becomes real. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds CI-side safety nets to detect and surface the rare failure mode where GitHub drops pull_request webhook deliveries, leaving required checks stuck in an ambiguous "Expected -- Waiting" state.
Changes:
- Introduces a
Merge Gateorchestrator workflow that polls the Checks API for the Tier 1 PR check and fails with a clear error if the check never appears. - Adds a scheduled watchdog workflow + script to scan open PRs and post a one-time recovery comment when the Tier 1 workflow never dispatched.
- Adds a changelog entry describing the new CI safety nets.
Show a summary per file
| File | Description |
|---|---|
CHANGELOG.md |
Adds an Unreleased entry describing the merge gate + watchdog rollout. |
.github/workflows/merge-gate.yml |
New orchestrator workflow that waits for Tier 1 and turns "missing check" into a failing gate. |
.github/scripts/ci/merge_gate_wait.sh |
New polling script for check-runs on a commit SHA with distinct exit codes for failure modes. |
.github/workflows/watchdog-stuck-prs.yml |
New scheduled workflow that runs the PR scan every 15 minutes (and via manual dispatch). |
.github/scripts/ci/watchdog_scan.sh |
New scanning + idempotent comment-posting logic to help contributors recover from stuck CI dispatch. |
Copilot's findings
- Files reviewed: 5/5 changed files
- Comments generated: 5
| non_ignored=$(gh api \ | ||
| -H "Accept: application/vnd.github+json" \ | ||
| "repos/${REPO}/pulls/${number}/files?per_page=300" \ | ||
| --jq '[.[].filename] | map(select( | ||
| (startswith("docs/") | not) | ||
| and . != ".gitignore" | ||
| and . != "LICENSE" | ||
| )) | length' 2>/dev/null || echo "0") |
There was a problem hiding this comment.
The changed-files query only fetches the first page (per_page=300) and doesn't paginate. For PRs touching >300 files, non_ignored can be computed from an incomplete file list and misclassify a non-docs change as docs-only, skipping a truly stuck PR. Consider using gh api --paginate (or GraphQL) to ensure the full file list is considered.
|
|
||
| - New `enterprise/governance-guide.md` documentation page: flagship governance reference for CISO / VPE / Platform Tech Lead audiences, covering enforcement points, bypass contract, failure semantics, air-gapped operation, rollout playbook, and known gaps. Trims duplicated content in `governance.md`, `apm-policy.md`, and `integrations/github-rulesets.md`. Adds `templates/apm-policy-starter.yml`. (#851) | ||
| - `apm install` now supports Azure DevOps AAD bearer-token auth via `az account get-access-token`, with PAT-first fallback for orgs that disable PAT creation. Closes #852 (#856) | ||
| - New CI safety nets: `merge-gate.yml` orchestrator turns dropped `pull_request` webhook deliveries into clear red checks (currently shadow mode), and `watchdog-stuck-prs.yml` posts a recovery comment on PRs whose required check never dispatched. (PR follow-up to #856 CI flake) |
There was a problem hiding this comment.
The changelog entry doesn't follow the repo's stated format (one line per PR ending with (#PR_NUMBER)). Please update this line to end with this PR number, consistent with the other entries under ## [Unreleased] > Added.
| - New CI safety nets: `merge-gate.yml` orchestrator turns dropped `pull_request` webhook deliveries into clear red checks (currently shadow mode), and `watchdog-stuck-prs.yml` posts a recovery comment on PRs whose required check never dispatched. (PR follow-up to #856 CI flake) | |
| - New CI safety nets: `merge-gate.yml` orchestrator turns dropped `pull_request` webhook deliveries into clear red checks (currently shadow mode), and `watchdog-stuck-prs.yml` posts a recovery comment on PRs whose required check never dispatched. (#PR_NUMBER) |
| if: github.event_name == 'pull_request_target' | ||
| env: | ||
| GH_TOKEN: ${{ github.token }} | ||
| run: | | ||
| mkdir -p .github/scripts/ci |
There was a problem hiding this comment.
This curl fetch is a single point of failure for the whole gate. A transient network/5xx from the GitHub API will fail the job and produce a false-negative red check. Consider adding retries (e.g., --retry with a small backoff) and/or a clearer error message when the script cannot be fetched.
| *) | ||
| echo "::error title=Tier 1 failed::Build & Test (Linux) reported '${conclusion}'. See ${url}" | ||
| exit 1 |
There was a problem hiding this comment.
The failure annotation hard-codes Build & Test (Linux) instead of using EXPECTED_CHECK. If the orchestrator is later reused for a different required check name, the message will be misleading; prefer referencing the EXPECTED_CHECK variable consistently in the error text.
| prs=$(gh pr list --repo "$REPO" --state open --limit 200 \ | ||
| --json number,headRefOid,updatedAt,isDraft,title 2>/dev/null || echo '[]') |
There was a problem hiding this comment.
gh pr list failures are fully silenced (2>/dev/null) and treated as an empty PR set, which can hide real watchdog outages (auth regression, API rate limiting, transient GitHub errors). Consider failing the workflow (or at least logging stderr and exiting non-zero) when the PR list cannot be retrieved so maintainers notice the watchdog isn't actually scanning.
| prs=$(gh pr list --repo "$REPO" --state open --limit 200 \ | |
| --json number,headRefOid,updatedAt,isDraft,title 2>/dev/null || echo '[]') | |
| gh_pr_list_stderr=$(mktemp) | |
| if ! prs=$(gh pr list --repo "$REPO" --state open --limit 200 \ | |
| --json number,headRefOid,updatedAt,isDraft,title 2>"$gh_pr_list_stderr"); then | |
| echo "[watchdog] ERROR: failed to list open PRs for ${REPO}" >&2 | |
| cat "$gh_pr_list_stderr" >&2 | |
| rm -f "$gh_pr_list_stderr" | |
| exit 1 | |
| fi | |
| rm -f "$gh_pr_list_stderr" |
Dogfood resultThe gate workflow ran against this PR itself and passed in 1m34s alongside This validates both code paths:
The bootstrap branch is what makes shipping the orchestrator in the same PR that adds it possible. Once this PR merges to Ready for review. |
- merge_gate_wait.sh: use $EXPECTED_CHECK in failure annotation instead of hardcoded 'Build & Test (Linux)' so the orchestrator stays generic. - merge-gate.yml: add curl --retry/--max-time on the script bootstrap fetch so a transient GitHub API blip does not redden the gate. - watchdog_scan.sh: fail loudly with stderr capture if 'gh pr list' errors out, instead of silently treating it as zero PRs (which would hide auth regressions or rate limiting). - watchdog_scan.sh: paginate the changed-files API call so PRs touching >100 files cannot be misclassified as docs-only and skipped. - CHANGELOG: append (#865) to follow the repo convention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addressed Copilot review feedback (commit 123e193)All 5 comments adopted:
Re-validated: shellcheck clean, dry-run against microsoft/apm clean, gate + Build & Test green on this commit. |
The watchdog (cron every 15min, posts recovery comments on stuck PRs) was originally justified for the shadow-mode transition window. Since we are flipping to required immediately after this PR merges, that justification disappears. The merge-gate workflow already triggers on both 'pull_request' and 'pull_request_target' with concurrency dedup, so a single dropped webhook still fires the gate. The watchdog only added value for the double-drop case (both webhook channels fail for the same PR), which is vanishingly rare. We can add it back later if we ever observe one. Removes: - .github/workflows/watchdog-stuck-prs.yml - .github/scripts/ci/watchdog_scan.sh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dropped the watchdog (commit eac27d4)Per discussion: since we flip the gate to required immediately after merge, the watchdog's transition-period justification disappears. The gate's dual-trigger redundancy ( Removed:
PR title and body updated accordingly. Re-validation: gate + Build & Test still green on |
Phase 2 flip executed
Required checks on
From now on, any PR that hits the dropped- Follow-ups (separate PRs, after ~1 week of clean operation)
Rollback (if needed): |
Correction to my earlier follow-up suggestionIn the flip comment I suggested dropping the 4 stub names ( Re-reading Corrected matrix of required checks on
The only check that could potentially be dropped later is |
The dual-trigger pattern (pull_request + pull_request_target with concurrency cancel-in-progress) shipped in #865 was over-engineered. It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one CANCELLED -- and branch protection's status-check rollup treats CANCELLED as failure, so PRs were silently BLOCKED unless an admin overrode (which masked the bug on #867). GitHub Actions has no primitive for 'either of these events succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js) accept this and use a single trigger. The cost: a dropped 'pull_request' webhook (rare; observed once on PR #856) requires manual recovery. Recovery paths now documented at top of file: - push empty commit - gh workflow run merge-gate.yml -f pr_number=NNN - close + reopen PR Replaces the dual-trigger + bootstrap-fetch dance with a clean two-job flow: resolve-sha (handles workflow_dispatch input or PR head) then gate (sparse checkout + run script). Same script, same exit codes, same EXPECTED_CHECKS env. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore: prepare v0.9.2 release Bumps version to 0.9.2 and finalizes CHANGELOG with one-line summaries for each PR merged since 0.9.1. Highlights: - ADO AAD bearer-token auth (#856) - Governance Guide + enterprise docs IA refactor (#851, #858) - Merge Gate orchestrator + single-authority aggregation (#865, #867) - Landing + first-package docs rewrite (#855, #866) - gh-aw imports migration (#864) - Custom-port surfacing fix (#804) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: simplify merge-gate to single pull_request trigger The dual-trigger pattern (pull_request + pull_request_target with concurrency cancel-in-progress) shipped in #865 was over-engineered. It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one CANCELLED -- and branch protection's status-check rollup treats CANCELLED as failure, so PRs were silently BLOCKED unless an admin overrode (which masked the bug on #867). GitHub Actions has no primitive for 'either of these events succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js) accept this and use a single trigger. The cost: a dropped 'pull_request' webhook (rare; observed once on PR #856) requires manual recovery. Recovery paths now documented at top of file: - push empty commit - gh workflow run merge-gate.yml -f pr_number=NNN - close + reopen PR Replaces the dual-trigger + bootstrap-fetch dance with a clean two-job flow: resolve-sha (handles workflow_dispatch input or PR head) then gate (sparse checkout + run script). Same script, same exit codes, same EXPECTED_CHECKS env. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: collapse merge-gate into a single job (one check-run in PR UI) The two-job split (resolve-sha + gate) created two visible check-runs. Inlining the SHA resolution as a step within the gate job leaves only one check-run -- 'Merge Gate / gate' -- on the PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Why
PR #856 sat with
Build & Test (Linux)stuck atExpected -- Waiting for status to be reporteddespite the same commit having a green run onmainminutes earlier. Root cause: GitHub silently dropped thepull_requestwebhook delivery for that PR. There was no retrigger mechanism, no failure annotation, nothing actionable in the UI. The PR sat in limbo until I noticed.This is a structural fragility, not a one-off bug. Required-status-checks in branch protection are name-based, not workflow-based. If the workflow that produces a given check name never dispatches, the check stays in
Expectedforever. There is no built-in timeout.What this PR does
Adds one new workflow (
Merge Gate) that turns the failure mode into a clear red check.After this PR:
The gate triggers on both
pull_requestandpull_request_target(withconcurrencydedup). Ifpull_requestdrops,pull_request_targetstill fires the gate. Single dropped webhook no longer creates a stuck PR.Files
.github/workflows/merge-gate.ymlBuild & Test (Linux), fails red if it never appears.github/scripts/ci/merge_gate_wait.sh0ok,1failed,2never appeared,3stuck mid-run,4invalid env)CHANGELOG.mdSecurity review
The orchestrator runs on two events with different security postures:
pull_requestpull_request_targetpull_request_targetis the historically-dangerous trigger. We use it only to fetch the trusted base-ref script via the GitHub API and execute it; we neveractions/checkoutthe PR head on this event. This matches the same pattern used byci-integration-pr-stub.ymlalready in the repo.Self-bootstrap
Chicken-and-egg: this PR adds the script the gate depends on. On the very first run (this PR), the script doesn't exist on
mainyet, so the API fetch onpull_request_targetreturns 404. We handle this with a self-bootstrap fallback: write a no-opexit 0and emit::warning::so the PR can land. Once merged, all future PRs hit the real script.Test results (live)
shellcheck .github/scripts/ci/*.sh0000...0000SHA::errorannotationgatecheck on this PRBuild & Test (Linux)1m19s--retry 5 --retry-delay 3 --retry-connrefused --max-time 30Phase 2: flip plan (post-merge, manual)
After this PR merges, branch protection needs to be updated to make
Merge Gate / gaterequired. I will execute via API once you ack the merge:After ~1 week of clean operation, drop
Build & Test (Linux)from required and letMerge Gate / gatebe the sole required check. The 4 stub names fromci-integration-pr-stub.yml(Build/Smoke/Integration/Release-Validate Linux) can also be cleaned up once the gate covers the same surface.Rollback
Merge Gate / gateto required checks. Workflow becomes a no-op informational check.Out of scope (deliberate)
pull_requestANDpull_request_targetfail simultaneously), which is vanishingly rare. We can add it back if we ever observe one.ci.yml: this PR adds a gate, not a rewrite.ci.ymlkeeps doing its job.