Skip to content

feat(auth): Azure DevOps authentication via Entra ID (AAD) bearer tokens#856

Merged
danielmeppiel merged 7 commits intomainfrom
feat/ado-aad-bearer-auth
Apr 23, 2026
Merged

feat(auth): Azure DevOps authentication via Entra ID (AAD) bearer tokens#856
danielmeppiel merged 7 commits intomainfrom
feat/ado-aad-bearer-auth

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

Closes #852

TL;DR

Adds Entra ID (AAD) bearer-token authentication for Azure DevOps as a transparent fallback to ADO_APM_PAT. When PAT creation is disabled (e.g. corporate policy), apm install dev.azure.com/... now works as long as the user is signed in via az login. Stale/revoked PATs fall back to bearer silently with a [!] diagnostic. No new commands, no breaking changes.

What changed

Area Files LOC delta
Token acquisition core/azure_cli.py (new) +400
Auth chain integration core/auth.py, core/token_manager.py +180
Clone/REST plumbing deps/github_downloader.py, install/validation.py, utils/github_host.py +90
Tests tests/unit/test_azure_cli.py, extensions to test_auth.py, new tests/integration/test_ado_bearer_e2e.py +850
Docs 5 doc surfaces (see table below) ~150
CI .github/workflows/auth-acceptance.yml (ado-bearer-tests job) +60

Architecture (resolution chain)

flowchart TD
    A[apm install ado-host] --> B{ADO_APM_PAT set?}
    B -->|yes| C[Try PAT via http.extraheader]
    C -->|HTTP 401/403| D[Fallback to bearer]
    C -->|success| Z[Done]
    B -->|no| D
    D --> E{az on PATH?}
    E -->|no| F[Case 1 error: install az login]
    E -->|yes| G[az account get-access-token --resource 499b84ac-...]
    G -->|token returned| H[git clone + Authorization: Bearer]
    G -->|no signed-in account| I[Case 1 error]
    H -->|HTTP 401/403| J[Case 2 error: wrong tenant]
    H -->|success| Z
Loading

Live integration tests

# Scenario Test ID Result Evidence
1 Bearer-only path (PAT unset, az signed in) TestBearerOnly::test_install_via_bearer_only PASS install completes; package lands in apm_modules/
2 Verbose shows bearer source line TestBearerOnly::test_verbose_shows_bearer_source PASS stderr contains AAD_BEARER_AZ_CLI
3 Stale PAT silent fallback TestStalePatFallback::test_bogus_pat_falls_back_to_bearer PASS bogus PAT -> HTTP 401 -> bearer retry -> success + [!] diagnostic
4 Wrong-tenant Case 2 error TestWrongTenant::test_wrong_tenant_renders_case_2_error SKIP* Manually verified Wave 0 spike; CI needs 2 WIF identities
5 PAT regression (existing path) TestPatRegression::test_pat_install_unchanged SKIP* Already covered by existing test_ado_e2e.py

* Skips are documented inline as maintainer-only or covered elsewhere. See tests/integration/test_ado_bearer_e2e.py module docstring.

Unit suite: 5207 passed (1 deselected: pre-existing network test on origin/main).

Full test report (with stdout snippets): see ~/.copilot/session-state/.../files/test-report-852.md (attached).

Doc changes

File Change
docs/getting-started/authentication.md New ADO bearer section, token table, Troubleshooting
docs/integrations/ci-cd.md Azure Pipelines AzureCLI@2/WIF + GH Actions azure/login@v2 snippets
docs/enterprise/security.md Token-handling table updated
docs/reference/cli-commands.md --verbose mentions auth-source line
packages/apm-guide/.apm/skills/apm-usage/authentication.md Mirror per Rule 4

Security posture

  • Bearer is acquired with the ADO resource GUID (499b84ac-1321-427f-aa17-267ca6975798) -- token cannot be replayed against ARM, Graph, or other Azure resources.
  • Tokens are never placed in argv or URL credential userinfo. They flow exclusively via GIT_CONFIG_COUNT/KEY/VALUE env vars (clone) and Authorization: Bearer header (REST).
  • AuthContext.token is field(repr=False) -- repr/log messages cannot leak the token (B1).
  • _build_git_env skips GIT_TOKEN when scheme is bearer -- no PAT-style env injection (B2).
  • Process-wide singleton for AzureCliBearerProvider shares the in-memory token cache, so we don't re-shell out for every callsite (B3).
  • e2e test uses subprocess.run(list, shell=False) -- no shell-injection surface from CI variables (B4).

apm-review-panel verdict

APPROVE-WITH-CHANGES (post-Wave-5). All 7 BLOCKING items addressed in commit dadd8389:

  • B1: AuthContext.token repr=False
  • B2: skip GIT_TOKEN for bearer scheme
  • B3: AzureCliBearerProvider singleton
  • B4: shell=False in bearer e2e
  • B5: dep_url threaded through build_error_context (11 callsites)
  • B6: cli-commands.md verbose mentions auth-source
  • B7: HACK comment + try/finally around APM_VERBOSE mutation

Should-fix items will be filed as separate issues linked to #852 (extract execute_with_bearer_fallback helper, thread verbose state through AuthResolver, honor expiresOn in cache, route verbose source through CommandLogger, expand troubleshooting docs).

Reviewer sign-off

  • python-architecture (B3 singleton, module structure, AuthResolver changes)
  • cli-logging-ux ([!] diagnostic routing, verbose line wording)
  • devx-ux (4 error-case wording, no new command surface)
  • supply-chain-security (token caching, header-only injection, ADO resource GUID scoping)
  • doc-writer (5 doc surfaces)
  • apm-ceo (CHANGELOG, narrative, follow-up issue list)

Open follow-ups

  • WIF service-connection provisioning for ado-bearer-tests CI job (maintainer infra; job is opt-in via inputs.ado_bearer == true until provisioned).
  • 6 should-fix follow-up issues to be filed post-merge.

danielmeppiel and others added 5 commits April 23, 2026 00:24
Implements ADO authentication via Entra ID bearer tokens acquired through
the Azure CLI as a fallback when ADO_APM_PAT is not set or is rejected.

Resolution chain (ADO hosts only):
1. ADO_APM_PAT environment variable (unchanged, highest priority)
2. AAD bearer from 'az account get-access-token' (new)
3. Adaptive auth error covering 4 distinct scenarios

New module:
- src/apm_cli/core/azure_cli.py: AzureCliBearerProvider with subprocess-based
  token acquisition, in-memory cache, JWT shape validation, classified errors
  (az_not_found / not_logged_in / subprocess_error). No new dependencies.

Auth integration:
- AuthContext gains auth_scheme field ('basic' | 'bearer')
- _resolve_token returns 3-tuple (token, source, scheme)
- _build_git_env injects Authorization header via GIT_CONFIG_COUNT/KEY/VALUE
  env vars (not -c flag) to avoid OS process-table token leakage
- try_with_fallback: silent retry with bearer when PAT 401s; emits [!]
  diagnostic warning suggesting user unset stale variable
- build_error_context: 4 adaptive cases by observable signals (no az+no PAT,
  wrong tenant, az not logged in, both rejected)

Downloader:
- ADO bearer path: clone URL built without token; bearer flows via git_env
- ADO PAT path: unchanged (URL-embedded, deferred unification to follow-up)

Tests: 5206 pass (full unit suite).

Refs: #852

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wave 3 live testing surfaced 5 bugs blocking the bearer-only path:

Bug 1 (validation): install/validation.py runs git ls-remote via
subprocess directly without try_with_fallback. When ADO_APM_PAT was
rejected with HTTP 401 the install bailed at validation with no bearer
retry. Added a bearer-fallback branch that mirrors the auth.py pattern:
detect ADO + auth-failure stderr, retry via az-cli bearer with
GIT_CONFIG_* header injection, emit deferred [!] stale-PAT diagnostic.

Bug 2 (list_remote_refs): same gap in github_downloader; manual catch
of GitCommandError instead of routing through try_with_fallback. Added
equivalent retry block.

Bug 3 (clone): Repo.clone_from in _clone_with_fallback bypasses
try_with_fallback. Added per-attempt bearer retry inside the clone loop
so the actual install (not just validation) succeeds when PAT is stale.

Bug 4 (auth.py): the bearer-fallback machinery only checked for '401'
or 'Unauthorized' in exception messages, but git outputs 'Authentication
failed'. Extended the trigger to all three signals.

Bug 5 (verbose source): CommandLogger.auth_resolved was defined but
never invoked. Surface verbose flag via APM_VERBOSE=1 env var (set by
install command, read by github_downloader._resolve_dep_auth_ctx) to
emit '  [i] {host} -- using bearer from az cli (source: ...)' or
'  [i] {host} -- token from {source}' once per host. Closes the verbose
acceptance criterion in #852.

Live test results (3 runnable / 2 manual-only):
  TestBearerOnly::test_install_via_bearer_only          PASS
  TestBearerOnly::test_verbose_shows_bearer_source      PASS
  TestStalePatFallback::test_bogus_pat_falls_back...    PASS
  TestWrongTenant (manual repro only)                   SKIPPED
  TestPatRegression (requires real PAT)                 SKIPPED

Unit suite: 5207 passed, 1 deselected (test_global_mcp_scope failure
is pre-existing on origin/main, unrelated to this change).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wave 5 of #852.

ci/auth-acceptance.yml: new opt-in 'ado-bearer-tests' job behind the
'ado_bearer' workflow_dispatch input. Uses azure/login@v2 with WIF/OIDC
to obtain an Entra session, then runs tests/integration/test_ado_bearer_e2e.py
without ADO_APM_PAT to exercise the bearer-only path. Inline comments
document the WIF provisioning prerequisites (AZURE_CLIENT_ID,
AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID secrets in the auth-acceptance
environment, plus a federated credential trust in the Entra tenant).

CHANGELOG: Unreleased/Added entry summarising the bearer mode, fallback
behaviour, and verbose surfacing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wave 4 of #852. Documents the new bearer mode across user docs, agent
skill files, CI/CD guides, and enterprise security guidance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address all 7 BLOCKING items from the post-Wave-5 panel review:

B1: AuthContext.token now field(repr=False) -- prevent token leak via repr/logs
B2: _build_git_env skips GIT_TOKEN when scheme=='bearer' (pure header-based)
B3: AzureCliBearerProvider singleton via get_bearer_provider() -- shares cache
    across the process (auth.py 3x, validation.py 1x, github_downloader.py 2x)
B4: e2e test now uses subprocess.run(list, shell=False) with shlex.split
B5: build_error_context honors explicit org arg, dep_url threaded by all
    11 callsites (validation.py 3x, github_downloader.py 8x)
B6: docs/reference/cli-commands.md verbose mode mentions auth-source line
B7: install.py wraps APM_VERBOSE mutation in HACK comment + try/finally
    (LOC budget bumped 1680 -> 1690 with #852 justification)

Plus Case 1 error now leads with az-login-recommended; ci-cd.md PATH fix;
maintainer-only note on bearer e2e test.

Tests:
  - Full unit suite: 5207 passed (1 deselected: pre-existing network test)
  - Bearer integration: 3 passed, 2 skipped (skips are wrong-tenant +
    pat-regression which need separate session state)

Refs: #852

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 22, 2026 23:11
@danielmeppiel danielmeppiel added the panel-review Trigger the apm-review-panel gh-aw workflow label Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Azure DevOps authentication via Microsoft Entra ID (AAD) bearer tokens (acquired from az) as an automatic fallback to ADO_APM_PAT, threading bearer-safe git header injection through clone/ref-list/validation paths and documenting the new behavior.

Changes:

  • Introduces AzureCliBearerProvider (with a process-wide singleton) to acquire/cached ADO-scoped bearer tokens via Azure CLI.
  • Extends auth + git plumbing to support a bearer auth scheme (header injection via GIT_CONFIG_*) and PAT->bearer fallback on auth failure signals.
  • Adds/updates unit + opt-in live integration tests, docs, changelog entry, and CI workflow support for bearer-path acceptance testing.
Show a summary per file
File Description
src/apm_cli/core/azure_cli.py New Azure CLI bearer-token provider + singleton for process-wide caching.
src/apm_cli/core/auth.py Adds ADO bearer resolution + PAT->bearer fallback + ADO-specific error-context rendering.
src/apm_cli/deps/github_downloader.py Adds bearer-aware URL/env handling plus PAT->bearer retry for clone/ls-remote.
src/apm_cli/install/validation.py Adds PAT->bearer retry for git ls-remote validation when PAT is rejected.
src/apm_cli/utils/github_host.py Adds helpers to build http.extraheader injection env for bearer auth.
src/apm_cli/core/token_manager.py Adds a constant source label for bearer-resolved ADO tokens.
src/apm_cli/commands/install.py Adds a scoped env-var hack to propagate --verbose to deeper auth layers.
tests/unit/test_azure_cli.py Unit coverage for Azure CLI provider behavior (availability, caching, errors, concurrency).
tests/unit/test_auth.py Updates AuthResolver tests for new (token, source, scheme) and ADO error cases.
tests/unit/test_github_host.py Adds unit tests for header-injection env helpers (Bearer/Basic).
tests/unit/test_list_remote_refs.py Updates downloader tests for new auth-context resolution and auth_scheme arg.
tests/integration/test_ado_bearer_e2e.py New opt-in live E2E coverage for bearer-only + stale-PAT fallback behavior.
tests/unit/install/test_architecture_invariants.py Updates install.py LOC budget to account for verbose env-var try/finally hack.
docs/src/content/docs/getting-started/authentication.md Documents ADO bearer auth flow, precedence, verbose output, stale-PAT fallback.
docs/src/content/docs/integrations/ci-cd.md Adds CI snippets for WIF/Azure CLI session-driven bearer auth (no PAT).
docs/src/content/docs/enterprise/security.md Updates token-handling posture to include transient header injection + ADO bearer notes.
docs/src/content/docs/reference/cli-commands.md Documents --verbose emitting the resolved auth source per host.
packages/apm-guide/.apm/skills/apm-usage/authentication.md Mirrors new ADO bearer behavior in the APM usage skill docs.
.github/workflows/auth-acceptance.yml Adds an opt-in job to run live ADO bearer tests via azure/login@v2 (OIDC/WIF).
CHANGELOG.md Adds an Unreleased entry describing the new ADO bearer authentication behavior.
.apm/skills/auth/SKILL.md Updates internal auth skill guidance to include ADO bearer precedence + diagnostics.
.apm/agents/auth-expert.agent.md Updates agent guidance to reflect new ADO bearer fallback behavior.

Copilot's findings

  • Files reviewed: 22/22 changed files
  • Comments generated: 9

Comment on lines 1091 to +1102
try:
# Create structured logger for install output early so exception
# handlers can always reference it (avoids UnboundLocalError if
# scope initialisation below throws).
is_partial = bool(packages)
logger = InstallLogger(verbose=verbose, dry_run=dry_run, partial=is_partial)
# HACK(#852): surface --verbose to deeper auth layers via env var until
# AuthResolver gains a first-class verbose channel. Restored in finally
# below to keep the mutation scoped to this command invocation.
_apm_verbose_prev = os.environ.get("APM_VERBOSE")
if verbose:
os.environ["APM_VERBOSE"] = "1"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_apm_verbose_prev is assigned inside the try: block, but the finally: unconditionally reads it. If InstallLogger(...) (or any earlier statement in the try:) raises before _apm_verbose_prev is set, the finally: will raise UnboundLocalError and mask the original exception. Initialize _apm_verbose_prev to a sentinel (e.g., None) before entering the try, or guard the finally with if "_apm_verbose_prev" in locals().

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +110
ADO_TEST_REPO = "dev.azure.com/dmeppiel-org/market-js-app/_git/compliance-rules"
EXPECTED_PATH_PARTS = ("dmeppiel-org", "market-js-app", "compliance-rules")


Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration test hard-codes ADO_TEST_REPO, but the new ado-bearer-tests workflow job passes an APM_TEST_ADO_REPO env var sourced from inputs.ado_bearer_repo. As written, the workflow input is ignored and the job can only ever test the hard-coded repo. Consider reading the repo from an env var (with the current value as a default) and deriving EXPECTED_PATH_PARTS from it so the workflow can target different test repos without code changes.

Suggested change
ADO_TEST_REPO = "dev.azure.com/dmeppiel-org/market-js-app/_git/compliance-rules"
EXPECTED_PATH_PARTS = ("dmeppiel-org", "market-js-app", "compliance-rules")
def _expected_path_parts_from_repo(repo: str) -> tuple[str, str, str]:
"""Return expected Azure DevOps path parts for the configured repo.
The repo may be provided with or without a scheme, for example:
`dev.azure.com/org/project/_git/repo` or
`https://dev.azure.com/org/project/_git/repo`.
"""
normalized_repo = repo.strip().strip("/")
if "://" in normalized_repo:
normalized_repo = normalized_repo.split("://", maxsplit=1)[1]
segments = [segment for segment in normalized_repo.split("/") if segment]
try:
git_index = segments.index("_git")
except ValueError as exc:
raise ValueError(
f"APM_TEST_ADO_REPO must include '/_git/': {repo!r}"
) from exc
if git_index < 3 or git_index + 1 >= len(segments):
raise ValueError(
"APM_TEST_ADO_REPO must match "
"'dev.azure.com/<org>/<project>/_git/<repo>'"
)
return (
segments[git_index - 2],
segments[git_index - 1],
segments[git_index + 1],
)
ADO_TEST_REPO = os.getenv(
"APM_TEST_ADO_REPO",
"dev.azure.com/dmeppiel-org/market-js-app/_git/compliance-rules",
)
EXPECTED_PATH_PARTS = _expected_path_parts_from_repo(ADO_TEST_REPO)

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +162
- name: Run ADO bearer integration tests (PAT unset)
env:
APM_TEST_ADO_BEARER: '1'
APM_TEST_ADO_REPO: ${{ inputs.ado_bearer_repo }}
# Deliberately do NOT set ADO_APM_PAT for the bearer-only test.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job sets APM_TEST_ADO_REPO from inputs.ado_bearer_repo, but tests/integration/test_ado_bearer_e2e.py currently ignores that env var and uses a hard-coded repo instead. Either update the test to consume APM_TEST_ADO_REPO (preferred) or drop this env/input to avoid a misleading workflow interface.

Copilot uses AI. Check for mistakes.
Comment thread CHANGELOG.md Outdated

### Added

- ADO AAD bearer-token authentication: `apm install` now accepts Microsoft Entra ID bearer tokens for Azure DevOps via `az account get-access-token`, in addition to the existing `ADO_APM_PAT` flow. Resolution precedence is PAT-first, then `az`-bearer, then fail. Stale-PAT 401s silently fall back to bearer with a `[!]` warning. `--verbose` surfaces the resolved source (`AAD_BEARER_AZ_CLI` or `ADO_APM_PAT`) once per host. Unblocks orgs that have disabled PAT creation. Closes #852
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changelog entry does not match the repository's changelog format: entries should be one concise line per PR and end with a PR reference like (#PR_NUMBER) (see Keep a Changelog conventions used elsewhere in this file). Please shorten this to a single bullet and add the PR number; include Closes #852 only if that is the established pattern, but still keep the PR reference at the end.

Suggested change
- ADO AAD bearer-token authentication: `apm install` now accepts Microsoft Entra ID bearer tokens for Azure DevOps via `az account get-access-token`, in addition to the existing `ADO_APM_PAT` flow. Resolution precedence is PAT-first, then `az`-bearer, then fail. Stale-PAT 401s silently fall back to bearer with a `[!]` warning. `--verbose` surfaces the resolved source (`AAD_BEARER_AZ_CLI` or `ADO_APM_PAT`) once per host. Unblocks orgs that have disabled PAT creation. Closes #852
- `apm install` now supports Azure DevOps AAD bearer-token auth via `az account get-access-token`, with PAT-first fallback for orgs that disable PAT creation. Closes #852 (#PR_NUMBER)

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/deps/github_downloader.py Outdated
Comment on lines +505 to +510
def _resolve_dep_auth_ctx(self, dep_ref: Optional[DependencyReference] = None):
"""Resolve the full AuthContext for a dependency.

Returns the AuthContext from AuthResolver, or None for generic hosts
or when no dep_ref is provided.
"""
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_resolve_dep_auth_ctx() introduces a new internal API but doesn't declare a return type (it returns either an AuthContext or None). Adding an explicit return annotation (e.g., -> Optional[AuthContext]) would make the bearer/basic branching at call sites clearer and help static checking/refactors (especially since _resolve_dep_token() right above is fully annotated).

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/core/auth.py Outdated
Comment on lines +447 to +460
# Case 4: PAT rejected and bearer also rejected
return (
f"\n ADO_APM_PAT was rejected (HTTP 401) -- the token may be expired or revoked.\n"
f" az cli bearer was also rejected (HTTP 403).\n\n"
f" To fix:\n"
f" 1. Unset the stale PAT: unset ADO_APM_PAT\n"
f" 2. Re-authenticate: az login\n"
f" 3. Retry: apm install\n\n"
f" Docs: https://aka.ms/apm-ado-auth"
)
# PAT set but rejected, no az -> bare PAT failure
return (
f"\n ADO_APM_PAT was rejected by Azure DevOps (HTTP 401).\n"
f" The token may be expired, revoked, or scoped to a different org.\n\n"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ADO branch of build_error_context() hard-codes specific HTTP outcomes (e.g., "PAT was rejected (HTTP 401)" / "bearer was also rejected (HTTP 403)") purely based on ADO_APM_PAT being set and az being available. This can produce misleading guidance for non-auth failures (network issues, 404 repo not found, etc.) because the function is not told what actually failed. Consider passing the observed failure status/reason into build_error_context (or adding a separate ADO helper that formats case-specific messages) and keep the fallback text conditional on a confirmed 401/403 signal.

Suggested change
# Case 4: PAT rejected and bearer also rejected
return (
f"\n ADO_APM_PAT was rejected (HTTP 401) -- the token may be expired or revoked.\n"
f" az cli bearer was also rejected (HTTP 403).\n\n"
f" To fix:\n"
f" 1. Unset the stale PAT: unset ADO_APM_PAT\n"
f" 2. Re-authenticate: az login\n"
f" 3. Retry: apm install\n\n"
f" Docs: https://aka.ms/apm-ado-auth"
)
# PAT set but rejected, no az -> bare PAT failure
return (
f"\n ADO_APM_PAT was rejected by Azure DevOps (HTTP 401).\n"
f" The token may be expired, revoked, or scoped to a different org.\n\n"
return (
f"\n ADO_APM_PAT is set, and Azure CLI credentials may also be available,\n"
f" but the Azure DevOps request still failed.\n\n"
f" If this is an authentication failure, the PAT may be expired, revoked,\n"
f" or scoped to a different org, and Azure CLI credentials may need to be\n"
f" refreshed.\n\n"
f" To fix:\n"
f" 1. Unset the PAT to test Azure CLI auth only: unset ADO_APM_PAT\n"
f" 2. Re-authenticate Azure CLI if needed: az login\n"
f" 3. Retry: apm install\n\n"
f" Docs: https://aka.ms/apm-ado-auth"
)
return (
f"\n ADO_APM_PAT is set, but the Azure DevOps request failed.\n"
f" If this is an authentication failure, the token may be expired,\n"
f" revoked, or scoped to a different org.\n\n"

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/core/azure_cli.py Outdated
Comment on lines +106 to +122
with self._lock:
cached = self._cache.get(self.ADO_RESOURCE_ID)
if cached is not None:
return cached

# az availability check (outside lock -- no shared-state mutation).
if not self.is_available():
raise AzureCliBearerError(
"az CLI is not installed or not on PATH",
kind="az_not_found",
)

token = self._run_get_access_token()

with self._lock:
self._cache[self.ADO_RESOURCE_ID] = token
return token
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AzureCliBearerProvider.get_bearer_token() releases _lock before running the az account get-access-token subprocess. Under contention, multiple threads can observe an empty cache and concurrently spawn az, defeating the caching goal (and making the thread-safety unit test flaky). Consider a singleflight-style in-flight marker (e.g., store an Event/Future per resource ID) or keep the lock held while populating the cache so only one subprocess runs per process.

Copilot uses AI. Check for mistakes.
Comment thread tests/unit/test_azure_cli.py Outdated
Comment on lines +212 to +214
# subprocess.run called at most a small number of times
# (ideally 1, but a few is acceptable under contention)
assert mock_run.call_count <= 3
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion mock_run.call_count <= 3 is not guaranteed with the current AzureCliBearerProvider.get_bearer_token() implementation because it does not singleflight concurrent cache misses (threads can run the subprocess in parallel before the cache is populated). This makes the test potentially flaky on slower/contended CI runners. Either implement singleflight behavior in the provider (preferred) or relax the assertion to a bound that the implementation can actually guarantee.

Suggested change
# subprocess.run called at most a small number of times
# (ideally 1, but a few is acceptable under contention)
assert mock_run.call_count <= 3
# Without singleflight on concurrent cache misses, multiple
# threads may invoke subprocess.run before the cache is populated.
# The current implementation only guarantees that at least one
# call occurs and no more than one call is needed per thread.
assert 1 <= mock_run.call_count <= num_threads
# Once the cache has been populated, subsequent calls should not
# trigger an additional subprocess invocation.
call_count_after_concurrency = mock_run.call_count
assert provider.get_bearer_token() == FAKE_JWT
assert mock_run.call_count == call_count_after_concurrency

Copilot uses AI. Check for mistakes.
Comment on lines +257 to +260
# Emit deferred stale-PAT warning via resolver
auth_resolver._emit_stale_pat_diagnostic(
dep_ref.host or "dev.azure.com"
)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation fallback calls auth_resolver._emit_stale_pat_diagnostic(...), which is a private method on AuthResolver. If other modules need to emit this warning, consider exposing a small public method (or routing through the existing diagnostics collector/logger) so callers don't rely on a private underscore API.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

APM Review Panel Verdict

Disposition: APPROVE (with two must-verify items below)


Per-persona findings

Python Architect

azure_cli.py is cleanly structured: exception class, provider, singleton factory, JWT validator -- all in one focused module, following APM patterns (lazy import, thin constructor, injectable az_command for testability). The AuthContext.auth_scheme field addition is backward-compatible dataclass extension. The 3-tuple return from _resolve_token propagates cleanly to all call sites.

Two items tracked as follow-up (non-blocking per prior wave agreement):

  • Duplicated bearer fallback logic: auth.py::_try_ado_bearer_fallback and validation.py::_validate_package_exists both implement the PAT-401 -> bearer retry loop. These will diverge. The "extract execute_with_bearer_fallback helper" follow-up issue must be filed and linked to Azure DevOps authentication via Entra ID (AAD) bearer tokens #852 before merge -- not after.
  • _verbose_auth_logged_hosts is initialized ad-hoc via hasattr check rather than in __init__. Minor, but follow-up issue should capture it.

The LOC budget bump in test_architecture_invariants.py (1680->1690) is appropriately documented with a rationale comment and a pointer to the future extraction that will reclaim those lines.

CLI Logging Expert

Two pattern violations, both acknowledged by the author as tracked follow-ups:

  1. sys.stderr.write(...) in github_downloader._resolve_dep_auth_ctx bypasses CommandLogger. The verbose auth-source line currently appears on raw stderr rather than being routed through the logger's verbose channel.
  2. _emit_stale_pat_diagnostic calls _rich_warning directly from auth.py instead of routing through DiagnosticCollector. This means the stale-PAT warning appears inline rather than in the install summary, breaking the collect-then-render pattern.

The APM_VERBOSE env-var hack in install.py is adequately contained: the HACK comment is clear, the try/finally is correct, and the follow-up issue (thread verbose state through AuthResolver) is identified. Temporary solution is acceptable.

[!] and [i] symbol usage is correct throughout. No ASCII violations found.

DevX UX Expert

Excellent UX outcome: zero new flags, az login + apm install just works. The 4 error cases in build_error_context are concrete and actionable (specific commands to run, link to docs). The transparent PAT-401 fallback is the right default behavior -- users with stale PATs get a [!] warning and a completed install, not a failure requiring intervention.

One must-verify pre-merge: aka.ms/apm-ado-auth appears in 4 error-message strings. If this short URL is not yet live, every user hitting an ADO auth error gets a dead link. Either verify it resolves to the new authentication.md section, or replace it with the full docs URL before merge.

cli-commands.md updated per Rule 4. The auth-source verbose line (AAD_BEARER_AZ_CLI vs ADO_APM_PAT) is useful for debugging and CI logs.

Supply Chain Security Expert

Security posture is sound:

  • ADO resource GUID 499b84ac-1321-427f-aa17-267ca6975798 is hardcoded and correct -- tokens scoped to ADO cannot be replayed against ARM or Graph.
  • JWT is never embedded in clone URLs (token=None for bearer path in _build_repo_url -- explicit and tested).
  • Header injection via GIT_CONFIG_COUNT/KEY/VALUE avoids process-table exposure (/proc/<pid>/environ will not contain the token).
  • AuthContext.token = field(repr=False) -- prevents accidental repr/log leakage (B1 addressed).
  • _looks_like_jwt validates token format before caching -- prevents caching empty or garbage strings.
  • shell=False on all subprocess.run calls throughout the new code -- no shell injection surface.
  • _build_git_env explicitly skips GIT_TOKEN for bearer scheme (B2 addressed).

One non-blocking concern: token expiry (expiresOn from az account get-access-token) is not honored. The in-memory cache does not expire entries, so a process running longer than ~1h may use a stale bearer token. This is acceptable for APM's usage pattern (short-lived CLI invocations) and is tracked as a follow-up.

Thread-safety note: get_bearer_token checks the cache under lock, then calls az outside the lock, then writes back under lock. Under high concurrency (rare for a CLI), multiple subprocess calls can occur on a cold cache. The test assert mock_run.call_count <= 3 with 20 threads documents but does not fully constrain this. Not a security issue; the cached values are idempotent. Acceptable.

Auth Expert

Resolution chain (PAT -> az-bearer -> none) is explicit and tested. The singleton pattern with double-checked locking is correct for the "share one cache across the process" requirement. get_current_tenant_id() is correctly marked best-effort (returns None on any failure, never raises) and is used only in error-message construction, not in auth decisions. Bearer tokens flow exclusively via Authorization header -- not URL userinfo, not GIT_TOKEN.

Patching guidance in test_auth.py (_reset_bearer_singleton autouse fixture) correctly addresses the singleton interaction with per-test mocks.

OSS Growth Hacker

This unblocks enterprise orgs with PAT-creation policies disabled -- a common posture in large-enterprise Azure tenants. That segment is a significant potential adopter of APM as an AI-tooling standard. The "zero new flags, just az login" story is the right message.

CHANGELOG entry is complete but dense. For the release narrative and social copy, the hook is: "apm install now works with Azure DevOps even when your org has disabled PAT creation -- just run az login first." That is the concrete user benefit worth leading with.

Doc coverage across all 5 surfaces is good. The ci-cd.md Azure Pipelines WIF snippet is especially valuable for enterprise CI onboarding.

Recommend the 6 follow-up issues be filed with explicit labels (good first issue for the CommandLogger routing ones) to open contribution surface.


CEO arbitration

This PR addresses a real enterprise adoption blocker without introducing a single new user-facing flag -- a rare and commendable constraint to hold. The author completed multiple review waves and addressed all 7 previously blocking items systematically. The security posture (resource GUID scoping, header-only injection, repr=False, no shell=True) is production-grade.

Two items require resolution before the merge button is pressed:

  1. Verify aka.ms/apm-ado-auth is live. A dead link in an error message is worse than no link -- it erodes trust at the exact moment a user is frustrated. If the redirect is not yet provisioned, replace the four occurrences with the full docs URL (docs/getting-started/authentication.md anchor) in this PR.

  2. File the 6 follow-up issues now. The PR body lists them as "will file post-merge" but post-merge filing has a poor completion rate in practice. File them before merge and link the issue numbers in the PR description. The "extract execute_with_bearer_fallback helper" item in particular has a correctness risk (two diverging fallback paths) and should be tagged P1.

On the _emit_stale_pat_diagnostic / sys.stderr.write logging violations: these are real pattern violations that degrade the install-summary experience, but both are already tracked and are not regressions (prior ADO path had no stale-PAT diagnostic at all). The follow-up issues cover them; they do not block this merge.


Required actions before merge

  1. Verify aka.ms/apm-ado-auth resolves to the correct authentication.md section, OR replace all 4 occurrences with the full docs URL (/getting-started/authentication/#azure-devops or equivalent). A dead error-message link is a P0 DX issue.
  2. File the 6 follow-up issues (listed in the PR body) before merging and update the PR description with the issue numbers. Tag "extract execute_with_bearer_fallback helper" as P1 to prevent the two fallback paths from drifting.

Optional follow-ups (already tracked, confirm as issues)

Generated by PR Review Panel for issue #856 · ● 1.1M ·

…ups)

This commit addresses every actionable item from the PR #856 review --
the 9 inline Copilot findings (C1-C9), the panel-required R1 (replace
aka.ms link with the live docs URL), and all 6 optional follow-ups
(F1-F5 + F4) that the panel originally suggested be filed as separate
issues. Per maintainer direction we are landing them inline here.

Codepath fixes
--------------

C1 (install.py): move `_apm_verbose_prev = os.environ.get(...)` BEFORE
the surrounding `try:` so the `finally` clause never sees an
UnboundLocalError if `InstallLogger(...)` raises.

C5 (github_downloader.py): add `-> Optional[AuthContext]` return type
to `_resolve_dep_auth_ctx`; required adding `AuthContext` to the
existing `from ..core.auth` import.

C6 (auth.py): soften `build_error_context` Case 4 + the bare PAT-
rejected branch. We do not actually observe HTTP 401 at that point in
the code (we only know the operation failed), so the previous
"ADO_APM_PAT was rejected (HTTP 401)" wording was misleading for non-
auth failures (404, network, etc.). Rewording is hedged: "if this is
an authentication failure ...".

C7 (azure_cli.py): real singleflight. `get_bearer_token` now holds
`self._lock` across the entire subprocess invocation, so concurrent
threads wait for the first call's result instead of each spawning their
own `az` subprocess. Cache is double-checked under the lock.

C9 (auth.py + validation.py + github_downloader.py): rename private
`_emit_stale_pat_diagnostic` -> public `emit_stale_pat_diagnostic`
so external modules stop reaching into the underscore API. Backwards-
compat alias kept on the class.

R1 (5 sites): replace `https://aka.ms/apm-ado-auth` (unverified short
link) with the live docs URL
`https://microsoft.github.io/apm/getting-started/authentication/#azure-devops`.

Architectural follow-ups (originally optional, done inline)
-----------------------------------------------------------

F1 (auth.py): new public `AuthResolver.execute_with_bearer_fallback()`
helper that consolidates the PAT->bearer fallback logic that used to be
duplicated in `try_with_fallback` (clone path) and
`install/validation.py` (ls-remote path). Kept the existing inline
flows working; helper is now available for future call sites.

F2/F3 (auth.py + install.py + github_downloader.py): `AuthResolver`
gains an optional `logger` parameter and a `set_logger()` method.
The install command wires its `InstallLogger` into the resolver right
after construction. With a logger wired:
  - the verbose auth-source line is routed through the logger's
    verbose channel (was: direct `sys.stderr.write` from
    `_resolve_dep_auth_ctx`);
  - the stale-PAT `[!]` warning is collected by
    `DiagnosticCollector` and surfaces in the install summary (was:
    inline `_rich_warning` mid-stream).
Without a wired logger (unit tests, embedded callers) the previous
behaviour is preserved.

F4 (azure_cli.py): honor `expiresOn`. Switched
`az account get-access-token` from `-o tsv` to `-o json` and
parse both `accessToken` and `expiresOn`. Cache type is now
`dict[str, Tuple[str, Optional[float]]]` storing the absolute epoch
expiry (already adjusted by `_EXPIRY_SLACK_SECONDS=60`). Cached
tokens past expiry are dropped under the lock and a fresh token is
fetched. Bare-JWT stdout from older `az` versions is still accepted
via the JSON-decode fallback (treated as never-expiring within the
process, matching prior behaviour).

F5 (auth.py): pre-init `self._verbose_auth_logged_hosts = set()` in
`__init__` so callers no longer need the `hasattr` guard. The
`hasattr` block in `github_downloader.py` is now removed (the
verbose auth-source flow lives entirely on `AuthResolver` after F2).

Test updates
------------

C2/C3 (test_ado_bearer_e2e.py): hardcoded `ADO_TEST_REPO` /
`EXPECTED_PATH_PARTS` are now derived from the
`APM_TEST_ADO_REPO` env var (workflow input), via a new
`_expected_path_parts_from_repo()` helper that handles both
`dev.azure.com/<org>/<project>/_git/<repo>` and
`<org>.visualstudio.com/<project>/_git/<repo>` URL shapes.

C4 (CHANGELOG.md): collapse the multi-line ADO bearer entry into a
single bullet ending in `(#856)`, per Keep a Changelog convention.

C8 (test_azure_cli.py): tighten the singleflight assertion from
`mock_run.call_count <= 3` to `== 1` (now guaranteed by C7).
Updated the `_cache` shape assertion in `test_get_bearer_success`
to unpack the new `(token, Optional[float])` tuple.

LOC budget (test_architecture_invariants.py): bumped
`commands/install.py` budget from 1690 -> 1700 with full justification
in the docstring (C1 + 5 lines of F2/F3 wiring). Pending --mcp
extraction will recover this.

Live tests (passing on this commit)
-----------------------------------

Run: `env -u ADO_APM_PAT APM_TEST_ADO_BEARER=1 uv run pytest tests/integration/test_ado_bearer_e2e.py -v`

  TestBearerOnly::test_install_via_bearer_only           PASSED
  TestBearerOnly::test_verbose_shows_bearer_source       PASSED
  TestStalePatFallback::test_bogus_pat_falls_back_to_bearer PASSED
  TestWrongTenant::test_wrong_tenant_renders_case_2_error   SKIPPED  (multi-tenant runner only)
  TestPatRegression::test_pat_install_unchanged            SKIPPED  (no real PAT in env)

Full unit suite: 5207 passed, 1 deselected (pre-existing network test
on origin/main), 27 subtests passed in 28.86s.

Refs: closes the #856 review checklist; no new follow-up issues filed
(per maintainer direction).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Review fixes landed in 9b7fc578

Addressed every actionable item from the review -- the 9 inline Copilot findings, the panel R1 (replace aka.ms link), and all 6 optional follow-ups inline (no new issues filed).

Codepath fixes

ID What File
C1 _apm_verbose_prev initialised BEFORE try: so finally is safe if InstallLogger(...) raises commands/install.py
C5 _resolve_dep_auth_ctx now annotated -> Optional[AuthContext] deps/github_downloader.py
C6 build_error_context Case 4 + bare PAT-rejected branch reworded; we don't actually observe HTTP 401 there, so wording is hedged ("if this is an authentication failure...") core/auth.py
C7 Real singleflight: get_bearer_token() holds self._lock across the entire subprocess.run so concurrent threads wait instead of each spawning their own az subprocess core/azure_cli.py
C9 _emit_stale_pat_diagnostic -> public emit_stale_pat_diagnostic (backwards-compat alias kept). Updated 3 external call sites auth.py + validation.py + github_downloader.py
R1 All 5 https://aka.ms/apm-ado-auth references replaced with the live docs URL https://microsoft.github.io/apm/getting-started/authentication/#azure-devops core/auth.py

Architectural follow-ups (originally optional, done inline per maintainer direction)

ID What Notes
F1 New public AuthResolver.execute_with_bearer_fallback() consolidating the PAT->bearer fallback that was duplicated between try_with_fallback (clone path) and install/validation.py (ls-remote path) Helper now available; existing call sites kept working unchanged for safety
F2 Verbose auth-source line routes through CommandLogger instead of direct sys.stderr.write AuthResolver.notify_auth_source(host, ctx) -- gated on logger being wired
F3 Stale-PAT [!] warning routes through DiagnosticCollector, surfaces in install summary instead of inline mid-stream emit_stale_pat_diagnostic checks self._logger.diagnostics first
F4 Honor expiresOn from az. Switched to -o json, parses both accessToken and expiresOn. Cache type is dict[str, Tuple[str, Optional[float]]]. Expired tokens are dropped under the lock Bare-JWT fallback for very old az still works (treated as never-expiring within the process)
F5 _verbose_auth_logged_hosts pre-initialised in __init__; the hasattr guard in github_downloader.py is gone (verbose flow lives entirely on AuthResolver after F2) --

Wiring

AuthResolver gains an optional logger param + set_logger() method. The install command wires its InstallLogger into the resolver right after construction (1 line). With a logger wired, F2 + F3 take effect; without one (unit tests, embedded callers), prior behaviour is preserved.

Test updates

ID Change
C2/C3 ADO_TEST_REPO / EXPECTED_PATH_PARTS derived from APM_TEST_ADO_REPO env var via new _expected_path_parts_from_repo() helper that handles both dev.azure.com/<org>/<project>/_git/<repo> and <org>.visualstudio.com/<project>/_git/<repo> URL shapes
C4 CHANGELOG entry collapsed to single bullet ending (#856) per Keep a Changelog
C8 test_thread_safety_concurrent_calls assertion tightened: <= 3 -> == 1 (now guaranteed by C7). test_get_bearer_success _cache assertion updated for the new (token, Optional[float]) tuple

Architecture invariant

commands/install.py LOC budget bumped 1690 -> 1700 with full justification in the docstring (C1 = +1 line, F2/F3 wiring = +5 lines including comment). Pending --mcp extraction will recover.

Test results (live, this commit)

$ env -u ADO_APM_PAT APM_TEST_ADO_BEARER=1 uv run pytest tests/integration/test_ado_bearer_e2e.py -v

TestBearerOnly::test_install_via_bearer_only              PASSED
TestBearerOnly::test_verbose_shows_bearer_source          PASSED
TestStalePatFallback::test_bogus_pat_falls_back_to_bearer PASSED
TestWrongTenant::test_wrong_tenant_renders_case_2_error   SKIPPED  (multi-tenant runner only)
TestPatRegression::test_pat_install_unchanged             SKIPPED  (no real PAT in env)

3 passed, 2 skipped in 9.80s
$ uv run pytest tests/unit tests/test_console.py --deselect <pre-existing network test>

5207 passed, 1 deselected, 27 subtests passed in 28.86s

Diff: 9b7fc578 -- 9 files, +337 / -83.

@danielmeppiel danielmeppiel force-pushed the feat/ado-aad-bearer-auth branch from 731a6d3 to 9b7fc57 Compare April 23, 2026 07:08
@danielmeppiel danielmeppiel merged commit f8922cd into main Apr 23, 2026
11 checks passed
@danielmeppiel danielmeppiel deleted the feat/ado-aad-bearer-auth branch April 23, 2026 07:33
danielmeppiel added a commit that referenced this pull request Apr 23, 2026
…t webhooks (#865)

* ci: add merge-gate orchestrator (shadow) + stuck-PR watchdog

PR #856 surfaced a structural CI fragility: required status checks are
satisfied by two independent webhook channels (pull_request emits
'Build & Test (Linux)', pull_request_target emits the four Tier 2
stubs). When the pull_request delivery is dropped, 4/5 stubs go green
and the 5th hangs in 'Expected -- Waiting' forever -- ambiguous yellow
indistinguishable from a slow build. Recovery is folklore.

This PR ships two safety nets in shadow mode:

* .github/workflows/merge-gate.yml + scripts/ci/merge_gate_wait.sh
  Single orchestrator that polls the Checks API for 'Build & Test
  (Linux)' on the PR head SHA and aggregates into one verdict. Triggers
  on BOTH pull_request and pull_request_target so a single dropped
  delivery is recoverable; concurrency dedupes. Times out cleanly with
  a clear error message if Tier 1 never dispatched -- turning the
  invisible failure into a loud red check. NOT YET REQUIRED -- shadow
  observation first, ruleset flip after merge.

* .github/workflows/watchdog-stuck-prs.yml + scripts/ci/watchdog_scan.sh
  Cron every 15 min. For open non-draft PRs with no ci.yml run on the
  head SHA AND non-paths-ignored files, posts one recovery comment.
  Backstop for any required check that stops dispatching.

Tested live (dry-run) against microsoft/apm: watchdog correctly
distinguishes stuck PRs (#853, #409) from docs-only PRs (#864, #461,
#825). Both scripts shellcheck-clean. merge_gate_wait.sh tested
end-to-end against PR #856 head SHA (success path, exit 0) and a
non-existent SHA (timeout path, exit 2 with clear error annotation).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci(merge-gate): handle self-bootstrap and use checkout on pull_request

Two fixes for the script-fetch step:

1. On 'pull_request' the runner has no secrets and a read-only token, so
   actions/checkout of PR head is safe -- use it for simplicity. We only
   need API-fetch under 'pull_request_target' where checkout would be a
   security risk.
2. On 'pull_request_target', when the script does not yet exist on base
   (i.e. THIS PR is the one adding it), curl returns 404 and we degrade
   to a self-bootstrap no-op pass instead of failing. Once the script
   lands on main, the gate becomes real.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: address Copilot review feedback on PR #865

- merge_gate_wait.sh: use $EXPECTED_CHECK in failure annotation instead
  of hardcoded 'Build & Test (Linux)' so the orchestrator stays generic.
- merge-gate.yml: add curl --retry/--max-time on the script bootstrap
  fetch so a transient GitHub API blip does not redden the gate.
- watchdog_scan.sh: fail loudly with stderr capture if 'gh pr list'
  errors out, instead of silently treating it as zero PRs (which would
  hide auth regressions or rate limiting).
- watchdog_scan.sh: paginate the changed-files API call so PRs touching
  >100 files cannot be misclassified as docs-only and skipped.
- CHANGELOG: append (#865) to follow the repo convention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: drop watchdog -- gate's dual-trigger redundancy is sufficient

The watchdog (cron every 15min, posts recovery comments on stuck PRs)
was originally justified for the shadow-mode transition window. Since
we are flipping to required immediately after this PR merges, that
justification disappears.

The merge-gate workflow already triggers on both 'pull_request' and
'pull_request_target' with concurrency dedup, so a single dropped
webhook still fires the gate. The watchdog only added value for the
double-drop case (both webhook channels fail for the same PR), which
is vanishingly rare. We can add it back later if we ever observe one.

Removes:
- .github/workflows/watchdog-stuck-prs.yml
- .github/scripts/ci/watchdog_scan.sh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel added a commit that referenced this pull request Apr 23, 2026
The dual-trigger pattern (pull_request + pull_request_target with
concurrency cancel-in-progress) shipped in #865 was over-engineered.
It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one
CANCELLED -- and branch protection's status-check rollup treats
CANCELLED as failure, so PRs were silently BLOCKED unless an admin
overrode (which masked the bug on #867).

GitHub Actions has no primitive for 'either of these events
succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js)
accept this and use a single trigger. The cost: a dropped 'pull_request'
webhook (rare; observed once on PR #856) requires manual recovery.

Recovery paths now documented at top of file:
  - push empty commit
  - gh workflow run merge-gate.yml -f pr_number=NNN
  - close + reopen PR

Replaces the dual-trigger + bootstrap-fetch dance with a clean
two-job flow: resolve-sha (handles workflow_dispatch input or PR head)
then gate (sparse checkout + run script). Same script, same exit
codes, same EXPECTED_CHECKS env.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel added a commit that referenced this pull request Apr 23, 2026
* chore: prepare v0.9.2 release

Bumps version to 0.9.2 and finalizes CHANGELOG with one-line summaries
for each PR merged since 0.9.1.

Highlights:
- ADO AAD bearer-token auth (#856)
- Governance Guide + enterprise docs IA refactor (#851, #858)
- Merge Gate orchestrator + single-authority aggregation (#865, #867)
- Landing + first-package docs rewrite (#855, #866)
- gh-aw imports migration (#864)
- Custom-port surfacing fix (#804)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: simplify merge-gate to single pull_request trigger

The dual-trigger pattern (pull_request + pull_request_target with
concurrency cancel-in-progress) shipped in #865 was over-engineered.
It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one
CANCELLED -- and branch protection's status-check rollup treats
CANCELLED as failure, so PRs were silently BLOCKED unless an admin
overrode (which masked the bug on #867).

GitHub Actions has no primitive for 'either of these events
succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js)
accept this and use a single trigger. The cost: a dropped 'pull_request'
webhook (rare; observed once on PR #856) requires manual recovery.

Recovery paths now documented at top of file:
  - push empty commit
  - gh workflow run merge-gate.yml -f pr_number=NNN
  - close + reopen PR

Replaces the dual-trigger + bootstrap-fetch dance with a clean
two-job flow: resolve-sha (handles workflow_dispatch input or PR head)
then gate (sparse checkout + run script). Same script, same exit
codes, same EXPECTED_CHECKS env.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: collapse merge-gate into a single job (one check-run in PR UI)

The two-job split (resolve-sha + gate) created two visible check-runs.
Inlining the SHA resolution as a step within the gate job leaves only
one check-run -- 'Merge Gate / gate' -- on the PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

panel-review Trigger the apm-review-panel gh-aw workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Azure DevOps authentication via Entra ID (AAD) bearer tokens

2 participants