feat(auth): Azure DevOps authentication via Entra ID (AAD) bearer tokens#856
feat(auth): Azure DevOps authentication via Entra ID (AAD) bearer tokens#856danielmeppiel merged 7 commits intomainfrom
Conversation
Implements ADO authentication via Entra ID bearer tokens acquired through
the Azure CLI as a fallback when ADO_APM_PAT is not set or is rejected.
Resolution chain (ADO hosts only):
1. ADO_APM_PAT environment variable (unchanged, highest priority)
2. AAD bearer from 'az account get-access-token' (new)
3. Adaptive auth error covering 4 distinct scenarios
New module:
- src/apm_cli/core/azure_cli.py: AzureCliBearerProvider with subprocess-based
token acquisition, in-memory cache, JWT shape validation, classified errors
(az_not_found / not_logged_in / subprocess_error). No new dependencies.
Auth integration:
- AuthContext gains auth_scheme field ('basic' | 'bearer')
- _resolve_token returns 3-tuple (token, source, scheme)
- _build_git_env injects Authorization header via GIT_CONFIG_COUNT/KEY/VALUE
env vars (not -c flag) to avoid OS process-table token leakage
- try_with_fallback: silent retry with bearer when PAT 401s; emits [!]
diagnostic warning suggesting user unset stale variable
- build_error_context: 4 adaptive cases by observable signals (no az+no PAT,
wrong tenant, az not logged in, both rejected)
Downloader:
- ADO bearer path: clone URL built without token; bearer flows via git_env
- ADO PAT path: unchanged (URL-embedded, deferred unification to follow-up)
Tests: 5206 pass (full unit suite).
Refs: #852
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wave 3 live testing surfaced 5 bugs blocking the bearer-only path:
Bug 1 (validation): install/validation.py runs git ls-remote via
subprocess directly without try_with_fallback. When ADO_APM_PAT was
rejected with HTTP 401 the install bailed at validation with no bearer
retry. Added a bearer-fallback branch that mirrors the auth.py pattern:
detect ADO + auth-failure stderr, retry via az-cli bearer with
GIT_CONFIG_* header injection, emit deferred [!] stale-PAT diagnostic.
Bug 2 (list_remote_refs): same gap in github_downloader; manual catch
of GitCommandError instead of routing through try_with_fallback. Added
equivalent retry block.
Bug 3 (clone): Repo.clone_from in _clone_with_fallback bypasses
try_with_fallback. Added per-attempt bearer retry inside the clone loop
so the actual install (not just validation) succeeds when PAT is stale.
Bug 4 (auth.py): the bearer-fallback machinery only checked for '401'
or 'Unauthorized' in exception messages, but git outputs 'Authentication
failed'. Extended the trigger to all three signals.
Bug 5 (verbose source): CommandLogger.auth_resolved was defined but
never invoked. Surface verbose flag via APM_VERBOSE=1 env var (set by
install command, read by github_downloader._resolve_dep_auth_ctx) to
emit ' [i] {host} -- using bearer from az cli (source: ...)' or
' [i] {host} -- token from {source}' once per host. Closes the verbose
acceptance criterion in #852.
Live test results (3 runnable / 2 manual-only):
TestBearerOnly::test_install_via_bearer_only PASS
TestBearerOnly::test_verbose_shows_bearer_source PASS
TestStalePatFallback::test_bogus_pat_falls_back... PASS
TestWrongTenant (manual repro only) SKIPPED
TestPatRegression (requires real PAT) SKIPPED
Unit suite: 5207 passed, 1 deselected (test_global_mcp_scope failure
is pre-existing on origin/main, unrelated to this change).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wave 5 of #852. ci/auth-acceptance.yml: new opt-in 'ado-bearer-tests' job behind the 'ado_bearer' workflow_dispatch input. Uses azure/login@v2 with WIF/OIDC to obtain an Entra session, then runs tests/integration/test_ado_bearer_e2e.py without ADO_APM_PAT to exercise the bearer-only path. Inline comments document the WIF provisioning prerequisites (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID secrets in the auth-acceptance environment, plus a federated credential trust in the Entra tenant). CHANGELOG: Unreleased/Added entry summarising the bearer mode, fallback behaviour, and verbose surfacing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wave 4 of #852. Documents the new bearer mode across user docs, agent skill files, CI/CD guides, and enterprise security guidance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address all 7 BLOCKING items from the post-Wave-5 panel review:
B1: AuthContext.token now field(repr=False) -- prevent token leak via repr/logs
B2: _build_git_env skips GIT_TOKEN when scheme=='bearer' (pure header-based)
B3: AzureCliBearerProvider singleton via get_bearer_provider() -- shares cache
across the process (auth.py 3x, validation.py 1x, github_downloader.py 2x)
B4: e2e test now uses subprocess.run(list, shell=False) with shlex.split
B5: build_error_context honors explicit org arg, dep_url threaded by all
11 callsites (validation.py 3x, github_downloader.py 8x)
B6: docs/reference/cli-commands.md verbose mode mentions auth-source line
B7: install.py wraps APM_VERBOSE mutation in HACK comment + try/finally
(LOC budget bumped 1680 -> 1690 with #852 justification)
Plus Case 1 error now leads with az-login-recommended; ci-cd.md PATH fix;
maintainer-only note on bearer e2e test.
Tests:
- Full unit suite: 5207 passed (1 deselected: pre-existing network test)
- Bearer integration: 3 passed, 2 skipped (skips are wrong-tenant +
pat-regression which need separate session state)
Refs: #852
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds Azure DevOps authentication via Microsoft Entra ID (AAD) bearer tokens (acquired from az) as an automatic fallback to ADO_APM_PAT, threading bearer-safe git header injection through clone/ref-list/validation paths and documenting the new behavior.
Changes:
- Introduces
AzureCliBearerProvider(with a process-wide singleton) to acquire/cached ADO-scoped bearer tokens via Azure CLI. - Extends auth + git plumbing to support a bearer auth scheme (header injection via
GIT_CONFIG_*) and PAT->bearer fallback on auth failure signals. - Adds/updates unit + opt-in live integration tests, docs, changelog entry, and CI workflow support for bearer-path acceptance testing.
Show a summary per file
| File | Description |
|---|---|
src/apm_cli/core/azure_cli.py |
New Azure CLI bearer-token provider + singleton for process-wide caching. |
src/apm_cli/core/auth.py |
Adds ADO bearer resolution + PAT->bearer fallback + ADO-specific error-context rendering. |
src/apm_cli/deps/github_downloader.py |
Adds bearer-aware URL/env handling plus PAT->bearer retry for clone/ls-remote. |
src/apm_cli/install/validation.py |
Adds PAT->bearer retry for git ls-remote validation when PAT is rejected. |
src/apm_cli/utils/github_host.py |
Adds helpers to build http.extraheader injection env for bearer auth. |
src/apm_cli/core/token_manager.py |
Adds a constant source label for bearer-resolved ADO tokens. |
src/apm_cli/commands/install.py |
Adds a scoped env-var hack to propagate --verbose to deeper auth layers. |
tests/unit/test_azure_cli.py |
Unit coverage for Azure CLI provider behavior (availability, caching, errors, concurrency). |
tests/unit/test_auth.py |
Updates AuthResolver tests for new (token, source, scheme) and ADO error cases. |
tests/unit/test_github_host.py |
Adds unit tests for header-injection env helpers (Bearer/Basic). |
tests/unit/test_list_remote_refs.py |
Updates downloader tests for new auth-context resolution and auth_scheme arg. |
tests/integration/test_ado_bearer_e2e.py |
New opt-in live E2E coverage for bearer-only + stale-PAT fallback behavior. |
tests/unit/install/test_architecture_invariants.py |
Updates install.py LOC budget to account for verbose env-var try/finally hack. |
docs/src/content/docs/getting-started/authentication.md |
Documents ADO bearer auth flow, precedence, verbose output, stale-PAT fallback. |
docs/src/content/docs/integrations/ci-cd.md |
Adds CI snippets for WIF/Azure CLI session-driven bearer auth (no PAT). |
docs/src/content/docs/enterprise/security.md |
Updates token-handling posture to include transient header injection + ADO bearer notes. |
docs/src/content/docs/reference/cli-commands.md |
Documents --verbose emitting the resolved auth source per host. |
packages/apm-guide/.apm/skills/apm-usage/authentication.md |
Mirrors new ADO bearer behavior in the APM usage skill docs. |
.github/workflows/auth-acceptance.yml |
Adds an opt-in job to run live ADO bearer tests via azure/login@v2 (OIDC/WIF). |
CHANGELOG.md |
Adds an Unreleased entry describing the new ADO bearer authentication behavior. |
.apm/skills/auth/SKILL.md |
Updates internal auth skill guidance to include ADO bearer precedence + diagnostics. |
.apm/agents/auth-expert.agent.md |
Updates agent guidance to reflect new ADO bearer fallback behavior. |
Copilot's findings
- Files reviewed: 22/22 changed files
- Comments generated: 9
| try: | ||
| # Create structured logger for install output early so exception | ||
| # handlers can always reference it (avoids UnboundLocalError if | ||
| # scope initialisation below throws). | ||
| is_partial = bool(packages) | ||
| logger = InstallLogger(verbose=verbose, dry_run=dry_run, partial=is_partial) | ||
| # HACK(#852): surface --verbose to deeper auth layers via env var until | ||
| # AuthResolver gains a first-class verbose channel. Restored in finally | ||
| # below to keep the mutation scoped to this command invocation. | ||
| _apm_verbose_prev = os.environ.get("APM_VERBOSE") | ||
| if verbose: | ||
| os.environ["APM_VERBOSE"] = "1" |
There was a problem hiding this comment.
_apm_verbose_prev is assigned inside the try: block, but the finally: unconditionally reads it. If InstallLogger(...) (or any earlier statement in the try:) raises before _apm_verbose_prev is set, the finally: will raise UnboundLocalError and mask the original exception. Initialize _apm_verbose_prev to a sentinel (e.g., None) before entering the try, or guard the finally with if "_apm_verbose_prev" in locals().
| ADO_TEST_REPO = "dev.azure.com/dmeppiel-org/market-js-app/_git/compliance-rules" | ||
| EXPECTED_PATH_PARTS = ("dmeppiel-org", "market-js-app", "compliance-rules") | ||
|
|
||
|
|
There was a problem hiding this comment.
This integration test hard-codes ADO_TEST_REPO, but the new ado-bearer-tests workflow job passes an APM_TEST_ADO_REPO env var sourced from inputs.ado_bearer_repo. As written, the workflow input is ignored and the job can only ever test the hard-coded repo. Consider reading the repo from an env var (with the current value as a default) and deriving EXPECTED_PATH_PARTS from it so the workflow can target different test repos without code changes.
| ADO_TEST_REPO = "dev.azure.com/dmeppiel-org/market-js-app/_git/compliance-rules" | |
| EXPECTED_PATH_PARTS = ("dmeppiel-org", "market-js-app", "compliance-rules") | |
| def _expected_path_parts_from_repo(repo: str) -> tuple[str, str, str]: | |
| """Return expected Azure DevOps path parts for the configured repo. | |
| The repo may be provided with or without a scheme, for example: | |
| `dev.azure.com/org/project/_git/repo` or | |
| `https://dev.azure.com/org/project/_git/repo`. | |
| """ | |
| normalized_repo = repo.strip().strip("/") | |
| if "://" in normalized_repo: | |
| normalized_repo = normalized_repo.split("://", maxsplit=1)[1] | |
| segments = [segment for segment in normalized_repo.split("/") if segment] | |
| try: | |
| git_index = segments.index("_git") | |
| except ValueError as exc: | |
| raise ValueError( | |
| f"APM_TEST_ADO_REPO must include '/_git/': {repo!r}" | |
| ) from exc | |
| if git_index < 3 or git_index + 1 >= len(segments): | |
| raise ValueError( | |
| "APM_TEST_ADO_REPO must match " | |
| "'dev.azure.com/<org>/<project>/_git/<repo>'" | |
| ) | |
| return ( | |
| segments[git_index - 2], | |
| segments[git_index - 1], | |
| segments[git_index + 1], | |
| ) | |
| ADO_TEST_REPO = os.getenv( | |
| "APM_TEST_ADO_REPO", | |
| "dev.azure.com/dmeppiel-org/market-js-app/_git/compliance-rules", | |
| ) | |
| EXPECTED_PATH_PARTS = _expected_path_parts_from_repo(ADO_TEST_REPO) |
| - name: Run ADO bearer integration tests (PAT unset) | ||
| env: | ||
| APM_TEST_ADO_BEARER: '1' | ||
| APM_TEST_ADO_REPO: ${{ inputs.ado_bearer_repo }} | ||
| # Deliberately do NOT set ADO_APM_PAT for the bearer-only test. |
There was a problem hiding this comment.
This job sets APM_TEST_ADO_REPO from inputs.ado_bearer_repo, but tests/integration/test_ado_bearer_e2e.py currently ignores that env var and uses a hard-coded repo instead. Either update the test to consume APM_TEST_ADO_REPO (preferred) or drop this env/input to avoid a misleading workflow interface.
|
|
||
| ### Added | ||
|
|
||
| - ADO AAD bearer-token authentication: `apm install` now accepts Microsoft Entra ID bearer tokens for Azure DevOps via `az account get-access-token`, in addition to the existing `ADO_APM_PAT` flow. Resolution precedence is PAT-first, then `az`-bearer, then fail. Stale-PAT 401s silently fall back to bearer with a `[!]` warning. `--verbose` surfaces the resolved source (`AAD_BEARER_AZ_CLI` or `ADO_APM_PAT`) once per host. Unblocks orgs that have disabled PAT creation. Closes #852 |
There was a problem hiding this comment.
The new changelog entry does not match the repository's changelog format: entries should be one concise line per PR and end with a PR reference like (#PR_NUMBER) (see Keep a Changelog conventions used elsewhere in this file). Please shorten this to a single bullet and add the PR number; include Closes #852 only if that is the established pattern, but still keep the PR reference at the end.
| - ADO AAD bearer-token authentication: `apm install` now accepts Microsoft Entra ID bearer tokens for Azure DevOps via `az account get-access-token`, in addition to the existing `ADO_APM_PAT` flow. Resolution precedence is PAT-first, then `az`-bearer, then fail. Stale-PAT 401s silently fall back to bearer with a `[!]` warning. `--verbose` surfaces the resolved source (`AAD_BEARER_AZ_CLI` or `ADO_APM_PAT`) once per host. Unblocks orgs that have disabled PAT creation. Closes #852 | |
| - `apm install` now supports Azure DevOps AAD bearer-token auth via `az account get-access-token`, with PAT-first fallback for orgs that disable PAT creation. Closes #852 (#PR_NUMBER) |
| def _resolve_dep_auth_ctx(self, dep_ref: Optional[DependencyReference] = None): | ||
| """Resolve the full AuthContext for a dependency. | ||
|
|
||
| Returns the AuthContext from AuthResolver, or None for generic hosts | ||
| or when no dep_ref is provided. | ||
| """ |
There was a problem hiding this comment.
_resolve_dep_auth_ctx() introduces a new internal API but doesn't declare a return type (it returns either an AuthContext or None). Adding an explicit return annotation (e.g., -> Optional[AuthContext]) would make the bearer/basic branching at call sites clearer and help static checking/refactors (especially since _resolve_dep_token() right above is fully annotated).
| # Case 4: PAT rejected and bearer also rejected | ||
| return ( | ||
| f"\n ADO_APM_PAT was rejected (HTTP 401) -- the token may be expired or revoked.\n" | ||
| f" az cli bearer was also rejected (HTTP 403).\n\n" | ||
| f" To fix:\n" | ||
| f" 1. Unset the stale PAT: unset ADO_APM_PAT\n" | ||
| f" 2. Re-authenticate: az login\n" | ||
| f" 3. Retry: apm install\n\n" | ||
| f" Docs: https://aka.ms/apm-ado-auth" | ||
| ) | ||
| # PAT set but rejected, no az -> bare PAT failure | ||
| return ( | ||
| f"\n ADO_APM_PAT was rejected by Azure DevOps (HTTP 401).\n" | ||
| f" The token may be expired, revoked, or scoped to a different org.\n\n" |
There was a problem hiding this comment.
The ADO branch of build_error_context() hard-codes specific HTTP outcomes (e.g., "PAT was rejected (HTTP 401)" / "bearer was also rejected (HTTP 403)") purely based on ADO_APM_PAT being set and az being available. This can produce misleading guidance for non-auth failures (network issues, 404 repo not found, etc.) because the function is not told what actually failed. Consider passing the observed failure status/reason into build_error_context (or adding a separate ADO helper that formats case-specific messages) and keep the fallback text conditional on a confirmed 401/403 signal.
| # Case 4: PAT rejected and bearer also rejected | |
| return ( | |
| f"\n ADO_APM_PAT was rejected (HTTP 401) -- the token may be expired or revoked.\n" | |
| f" az cli bearer was also rejected (HTTP 403).\n\n" | |
| f" To fix:\n" | |
| f" 1. Unset the stale PAT: unset ADO_APM_PAT\n" | |
| f" 2. Re-authenticate: az login\n" | |
| f" 3. Retry: apm install\n\n" | |
| f" Docs: https://aka.ms/apm-ado-auth" | |
| ) | |
| # PAT set but rejected, no az -> bare PAT failure | |
| return ( | |
| f"\n ADO_APM_PAT was rejected by Azure DevOps (HTTP 401).\n" | |
| f" The token may be expired, revoked, or scoped to a different org.\n\n" | |
| return ( | |
| f"\n ADO_APM_PAT is set, and Azure CLI credentials may also be available,\n" | |
| f" but the Azure DevOps request still failed.\n\n" | |
| f" If this is an authentication failure, the PAT may be expired, revoked,\n" | |
| f" or scoped to a different org, and Azure CLI credentials may need to be\n" | |
| f" refreshed.\n\n" | |
| f" To fix:\n" | |
| f" 1. Unset the PAT to test Azure CLI auth only: unset ADO_APM_PAT\n" | |
| f" 2. Re-authenticate Azure CLI if needed: az login\n" | |
| f" 3. Retry: apm install\n\n" | |
| f" Docs: https://aka.ms/apm-ado-auth" | |
| ) | |
| return ( | |
| f"\n ADO_APM_PAT is set, but the Azure DevOps request failed.\n" | |
| f" If this is an authentication failure, the token may be expired,\n" | |
| f" revoked, or scoped to a different org.\n\n" |
| with self._lock: | ||
| cached = self._cache.get(self.ADO_RESOURCE_ID) | ||
| if cached is not None: | ||
| return cached | ||
|
|
||
| # az availability check (outside lock -- no shared-state mutation). | ||
| if not self.is_available(): | ||
| raise AzureCliBearerError( | ||
| "az CLI is not installed or not on PATH", | ||
| kind="az_not_found", | ||
| ) | ||
|
|
||
| token = self._run_get_access_token() | ||
|
|
||
| with self._lock: | ||
| self._cache[self.ADO_RESOURCE_ID] = token | ||
| return token |
There was a problem hiding this comment.
AzureCliBearerProvider.get_bearer_token() releases _lock before running the az account get-access-token subprocess. Under contention, multiple threads can observe an empty cache and concurrently spawn az, defeating the caching goal (and making the thread-safety unit test flaky). Consider a singleflight-style in-flight marker (e.g., store an Event/Future per resource ID) or keep the lock held while populating the cache so only one subprocess runs per process.
| # subprocess.run called at most a small number of times | ||
| # (ideally 1, but a few is acceptable under contention) | ||
| assert mock_run.call_count <= 3 |
There was a problem hiding this comment.
The assertion mock_run.call_count <= 3 is not guaranteed with the current AzureCliBearerProvider.get_bearer_token() implementation because it does not singleflight concurrent cache misses (threads can run the subprocess in parallel before the cache is populated). This makes the test potentially flaky on slower/contended CI runners. Either implement singleflight behavior in the provider (preferred) or relax the assertion to a bound that the implementation can actually guarantee.
| # subprocess.run called at most a small number of times | |
| # (ideally 1, but a few is acceptable under contention) | |
| assert mock_run.call_count <= 3 | |
| # Without singleflight on concurrent cache misses, multiple | |
| # threads may invoke subprocess.run before the cache is populated. | |
| # The current implementation only guarantees that at least one | |
| # call occurs and no more than one call is needed per thread. | |
| assert 1 <= mock_run.call_count <= num_threads | |
| # Once the cache has been populated, subsequent calls should not | |
| # trigger an additional subprocess invocation. | |
| call_count_after_concurrency = mock_run.call_count | |
| assert provider.get_bearer_token() == FAKE_JWT | |
| assert mock_run.call_count == call_count_after_concurrency |
| # Emit deferred stale-PAT warning via resolver | ||
| auth_resolver._emit_stale_pat_diagnostic( | ||
| dep_ref.host or "dev.azure.com" | ||
| ) |
There was a problem hiding this comment.
This validation fallback calls auth_resolver._emit_stale_pat_diagnostic(...), which is a private method on AuthResolver. If other modules need to emit this warning, consider exposing a small public method (or routing through the existing diagnostics collector/logger) so callers don't rely on a private underscore API.
APM Review Panel VerdictDisposition: APPROVE (with two must-verify items below) Per-persona findingsPython Architect
Two items tracked as follow-up (non-blocking per prior wave agreement):
The LOC budget bump in CLI Logging Expert Two pattern violations, both acknowledged by the author as tracked follow-ups:
The
DevX UX Expert Excellent UX outcome: zero new flags, One must-verify pre-merge:
Supply Chain Security Expert Security posture is sound:
One non-blocking concern: token expiry ( Thread-safety note: Auth Expert Resolution chain (PAT -> az-bearer -> none) is explicit and tested. The singleton pattern with double-checked locking is correct for the "share one cache across the process" requirement. Patching guidance in OSS Growth Hacker This unblocks enterprise orgs with PAT-creation policies disabled -- a common posture in large-enterprise Azure tenants. That segment is a significant potential adopter of APM as an AI-tooling standard. The "zero new flags, just CHANGELOG entry is complete but dense. For the release narrative and social copy, the hook is: "apm install now works with Azure DevOps even when your org has disabled PAT creation -- just run Doc coverage across all 5 surfaces is good. The Recommend the 6 follow-up issues be filed with explicit labels ( CEO arbitrationThis PR addresses a real enterprise adoption blocker without introducing a single new user-facing flag -- a rare and commendable constraint to hold. The author completed multiple review waves and addressed all 7 previously blocking items systematically. The security posture (resource GUID scoping, header-only injection, repr=False, no shell=True) is production-grade. Two items require resolution before the merge button is pressed:
On the Required actions before merge
Optional follow-ups (already tracked, confirm as issues)
|
…ups) This commit addresses every actionable item from the PR #856 review -- the 9 inline Copilot findings (C1-C9), the panel-required R1 (replace aka.ms link with the live docs URL), and all 6 optional follow-ups (F1-F5 + F4) that the panel originally suggested be filed as separate issues. Per maintainer direction we are landing them inline here. Codepath fixes -------------- C1 (install.py): move `_apm_verbose_prev = os.environ.get(...)` BEFORE the surrounding `try:` so the `finally` clause never sees an UnboundLocalError if `InstallLogger(...)` raises. C5 (github_downloader.py): add `-> Optional[AuthContext]` return type to `_resolve_dep_auth_ctx`; required adding `AuthContext` to the existing `from ..core.auth` import. C6 (auth.py): soften `build_error_context` Case 4 + the bare PAT- rejected branch. We do not actually observe HTTP 401 at that point in the code (we only know the operation failed), so the previous "ADO_APM_PAT was rejected (HTTP 401)" wording was misleading for non- auth failures (404, network, etc.). Rewording is hedged: "if this is an authentication failure ...". C7 (azure_cli.py): real singleflight. `get_bearer_token` now holds `self._lock` across the entire subprocess invocation, so concurrent threads wait for the first call's result instead of each spawning their own `az` subprocess. Cache is double-checked under the lock. C9 (auth.py + validation.py + github_downloader.py): rename private `_emit_stale_pat_diagnostic` -> public `emit_stale_pat_diagnostic` so external modules stop reaching into the underscore API. Backwards- compat alias kept on the class. R1 (5 sites): replace `https://aka.ms/apm-ado-auth` (unverified short link) with the live docs URL `https://microsoft.github.io/apm/getting-started/authentication/#azure-devops`. Architectural follow-ups (originally optional, done inline) ----------------------------------------------------------- F1 (auth.py): new public `AuthResolver.execute_with_bearer_fallback()` helper that consolidates the PAT->bearer fallback logic that used to be duplicated in `try_with_fallback` (clone path) and `install/validation.py` (ls-remote path). Kept the existing inline flows working; helper is now available for future call sites. F2/F3 (auth.py + install.py + github_downloader.py): `AuthResolver` gains an optional `logger` parameter and a `set_logger()` method. The install command wires its `InstallLogger` into the resolver right after construction. With a logger wired: - the verbose auth-source line is routed through the logger's verbose channel (was: direct `sys.stderr.write` from `_resolve_dep_auth_ctx`); - the stale-PAT `[!]` warning is collected by `DiagnosticCollector` and surfaces in the install summary (was: inline `_rich_warning` mid-stream). Without a wired logger (unit tests, embedded callers) the previous behaviour is preserved. F4 (azure_cli.py): honor `expiresOn`. Switched `az account get-access-token` from `-o tsv` to `-o json` and parse both `accessToken` and `expiresOn`. Cache type is now `dict[str, Tuple[str, Optional[float]]]` storing the absolute epoch expiry (already adjusted by `_EXPIRY_SLACK_SECONDS=60`). Cached tokens past expiry are dropped under the lock and a fresh token is fetched. Bare-JWT stdout from older `az` versions is still accepted via the JSON-decode fallback (treated as never-expiring within the process, matching prior behaviour). F5 (auth.py): pre-init `self._verbose_auth_logged_hosts = set()` in `__init__` so callers no longer need the `hasattr` guard. The `hasattr` block in `github_downloader.py` is now removed (the verbose auth-source flow lives entirely on `AuthResolver` after F2). Test updates ------------ C2/C3 (test_ado_bearer_e2e.py): hardcoded `ADO_TEST_REPO` / `EXPECTED_PATH_PARTS` are now derived from the `APM_TEST_ADO_REPO` env var (workflow input), via a new `_expected_path_parts_from_repo()` helper that handles both `dev.azure.com/<org>/<project>/_git/<repo>` and `<org>.visualstudio.com/<project>/_git/<repo>` URL shapes. C4 (CHANGELOG.md): collapse the multi-line ADO bearer entry into a single bullet ending in `(#856)`, per Keep a Changelog convention. C8 (test_azure_cli.py): tighten the singleflight assertion from `mock_run.call_count <= 3` to `== 1` (now guaranteed by C7). Updated the `_cache` shape assertion in `test_get_bearer_success` to unpack the new `(token, Optional[float])` tuple. LOC budget (test_architecture_invariants.py): bumped `commands/install.py` budget from 1690 -> 1700 with full justification in the docstring (C1 + 5 lines of F2/F3 wiring). Pending --mcp extraction will recover this. Live tests (passing on this commit) ----------------------------------- Run: `env -u ADO_APM_PAT APM_TEST_ADO_BEARER=1 uv run pytest tests/integration/test_ado_bearer_e2e.py -v` TestBearerOnly::test_install_via_bearer_only PASSED TestBearerOnly::test_verbose_shows_bearer_source PASSED TestStalePatFallback::test_bogus_pat_falls_back_to_bearer PASSED TestWrongTenant::test_wrong_tenant_renders_case_2_error SKIPPED (multi-tenant runner only) TestPatRegression::test_pat_install_unchanged SKIPPED (no real PAT in env) Full unit suite: 5207 passed, 1 deselected (pre-existing network test on origin/main), 27 subtests passed in 28.86s. Refs: closes the #856 review checklist; no new follow-up issues filed (per maintainer direction). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review fixes landed in
|
| ID | What | File |
|---|---|---|
| C1 | _apm_verbose_prev initialised BEFORE try: so finally is safe if InstallLogger(...) raises |
commands/install.py |
| C5 | _resolve_dep_auth_ctx now annotated -> Optional[AuthContext] |
deps/github_downloader.py |
| C6 | build_error_context Case 4 + bare PAT-rejected branch reworded; we don't actually observe HTTP 401 there, so wording is hedged ("if this is an authentication failure...") |
core/auth.py |
| C7 | Real singleflight: get_bearer_token() holds self._lock across the entire subprocess.run so concurrent threads wait instead of each spawning their own az subprocess |
core/azure_cli.py |
| C9 | _emit_stale_pat_diagnostic -> public emit_stale_pat_diagnostic (backwards-compat alias kept). Updated 3 external call sites |
auth.py + validation.py + github_downloader.py |
| R1 | All 5 https://aka.ms/apm-ado-auth references replaced with the live docs URL https://microsoft.github.io/apm/getting-started/authentication/#azure-devops |
core/auth.py |
Architectural follow-ups (originally optional, done inline per maintainer direction)
| ID | What | Notes |
|---|---|---|
| F1 | New public AuthResolver.execute_with_bearer_fallback() consolidating the PAT->bearer fallback that was duplicated between try_with_fallback (clone path) and install/validation.py (ls-remote path) |
Helper now available; existing call sites kept working unchanged for safety |
| F2 | Verbose auth-source line routes through CommandLogger instead of direct sys.stderr.write |
AuthResolver.notify_auth_source(host, ctx) -- gated on logger being wired |
| F3 | Stale-PAT [!] warning routes through DiagnosticCollector, surfaces in install summary instead of inline mid-stream |
emit_stale_pat_diagnostic checks self._logger.diagnostics first |
| F4 | Honor expiresOn from az. Switched to -o json, parses both accessToken and expiresOn. Cache type is dict[str, Tuple[str, Optional[float]]]. Expired tokens are dropped under the lock |
Bare-JWT fallback for very old az still works (treated as never-expiring within the process) |
| F5 | _verbose_auth_logged_hosts pre-initialised in __init__; the hasattr guard in github_downloader.py is gone (verbose flow lives entirely on AuthResolver after F2) |
-- |
Wiring
AuthResolver gains an optional logger param + set_logger() method. The install command wires its InstallLogger into the resolver right after construction (1 line). With a logger wired, F2 + F3 take effect; without one (unit tests, embedded callers), prior behaviour is preserved.
Test updates
| ID | Change |
|---|---|
| C2/C3 | ADO_TEST_REPO / EXPECTED_PATH_PARTS derived from APM_TEST_ADO_REPO env var via new _expected_path_parts_from_repo() helper that handles both dev.azure.com/<org>/<project>/_git/<repo> and <org>.visualstudio.com/<project>/_git/<repo> URL shapes |
| C4 | CHANGELOG entry collapsed to single bullet ending (#856) per Keep a Changelog |
| C8 | test_thread_safety_concurrent_calls assertion tightened: <= 3 -> == 1 (now guaranteed by C7). test_get_bearer_success _cache assertion updated for the new (token, Optional[float]) tuple |
Architecture invariant
commands/install.py LOC budget bumped 1690 -> 1700 with full justification in the docstring (C1 = +1 line, F2/F3 wiring = +5 lines including comment). Pending --mcp extraction will recover.
Test results (live, this commit)
$ env -u ADO_APM_PAT APM_TEST_ADO_BEARER=1 uv run pytest tests/integration/test_ado_bearer_e2e.py -v
TestBearerOnly::test_install_via_bearer_only PASSED
TestBearerOnly::test_verbose_shows_bearer_source PASSED
TestStalePatFallback::test_bogus_pat_falls_back_to_bearer PASSED
TestWrongTenant::test_wrong_tenant_renders_case_2_error SKIPPED (multi-tenant runner only)
TestPatRegression::test_pat_install_unchanged SKIPPED (no real PAT in env)
3 passed, 2 skipped in 9.80s
$ uv run pytest tests/unit tests/test_console.py --deselect <pre-existing network test>
5207 passed, 1 deselected, 27 subtests passed in 28.86s
Diff: 9b7fc578 -- 9 files, +337 / -83.
731a6d3 to
9b7fc57
Compare
# Conflicts: # CHANGELOG.md
…t webhooks (#865) * ci: add merge-gate orchestrator (shadow) + stuck-PR watchdog PR #856 surfaced a structural CI fragility: required status checks are satisfied by two independent webhook channels (pull_request emits 'Build & Test (Linux)', pull_request_target emits the four Tier 2 stubs). When the pull_request delivery is dropped, 4/5 stubs go green and the 5th hangs in 'Expected -- Waiting' forever -- ambiguous yellow indistinguishable from a slow build. Recovery is folklore. This PR ships two safety nets in shadow mode: * .github/workflows/merge-gate.yml + scripts/ci/merge_gate_wait.sh Single orchestrator that polls the Checks API for 'Build & Test (Linux)' on the PR head SHA and aggregates into one verdict. Triggers on BOTH pull_request and pull_request_target so a single dropped delivery is recoverable; concurrency dedupes. Times out cleanly with a clear error message if Tier 1 never dispatched -- turning the invisible failure into a loud red check. NOT YET REQUIRED -- shadow observation first, ruleset flip after merge. * .github/workflows/watchdog-stuck-prs.yml + scripts/ci/watchdog_scan.sh Cron every 15 min. For open non-draft PRs with no ci.yml run on the head SHA AND non-paths-ignored files, posts one recovery comment. Backstop for any required check that stops dispatching. Tested live (dry-run) against microsoft/apm: watchdog correctly distinguishes stuck PRs (#853, #409) from docs-only PRs (#864, #461, #825). Both scripts shellcheck-clean. merge_gate_wait.sh tested end-to-end against PR #856 head SHA (success path, exit 0) and a non-existent SHA (timeout path, exit 2 with clear error annotation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci(merge-gate): handle self-bootstrap and use checkout on pull_request Two fixes for the script-fetch step: 1. On 'pull_request' the runner has no secrets and a read-only token, so actions/checkout of PR head is safe -- use it for simplicity. We only need API-fetch under 'pull_request_target' where checkout would be a security risk. 2. On 'pull_request_target', when the script does not yet exist on base (i.e. THIS PR is the one adding it), curl returns 404 and we degrade to a self-bootstrap no-op pass instead of failing. Once the script lands on main, the gate becomes real. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: address Copilot review feedback on PR #865 - merge_gate_wait.sh: use $EXPECTED_CHECK in failure annotation instead of hardcoded 'Build & Test (Linux)' so the orchestrator stays generic. - merge-gate.yml: add curl --retry/--max-time on the script bootstrap fetch so a transient GitHub API blip does not redden the gate. - watchdog_scan.sh: fail loudly with stderr capture if 'gh pr list' errors out, instead of silently treating it as zero PRs (which would hide auth regressions or rate limiting). - watchdog_scan.sh: paginate the changed-files API call so PRs touching >100 files cannot be misclassified as docs-only and skipped. - CHANGELOG: append (#865) to follow the repo convention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: drop watchdog -- gate's dual-trigger redundancy is sufficient The watchdog (cron every 15min, posts recovery comments on stuck PRs) was originally justified for the shadow-mode transition window. Since we are flipping to required immediately after this PR merges, that justification disappears. The merge-gate workflow already triggers on both 'pull_request' and 'pull_request_target' with concurrency dedup, so a single dropped webhook still fires the gate. The watchdog only added value for the double-drop case (both webhook channels fail for the same PR), which is vanishingly rare. We can add it back later if we ever observe one. Removes: - .github/workflows/watchdog-stuck-prs.yml - .github/scripts/ci/watchdog_scan.sh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The dual-trigger pattern (pull_request + pull_request_target with concurrency cancel-in-progress) shipped in #865 was over-engineered. It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one CANCELLED -- and branch protection's status-check rollup treats CANCELLED as failure, so PRs were silently BLOCKED unless an admin overrode (which masked the bug on #867). GitHub Actions has no primitive for 'either of these events succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js) accept this and use a single trigger. The cost: a dropped 'pull_request' webhook (rare; observed once on PR #856) requires manual recovery. Recovery paths now documented at top of file: - push empty commit - gh workflow run merge-gate.yml -f pr_number=NNN - close + reopen PR Replaces the dual-trigger + bootstrap-fetch dance with a clean two-job flow: resolve-sha (handles workflow_dispatch input or PR head) then gate (sparse checkout + run script). Same script, same exit codes, same EXPECTED_CHECKS env. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore: prepare v0.9.2 release Bumps version to 0.9.2 and finalizes CHANGELOG with one-line summaries for each PR merged since 0.9.1. Highlights: - ADO AAD bearer-token auth (#856) - Governance Guide + enterprise docs IA refactor (#851, #858) - Merge Gate orchestrator + single-authority aggregation (#865, #867) - Landing + first-package docs rewrite (#855, #866) - gh-aw imports migration (#864) - Custom-port surfacing fix (#804) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: simplify merge-gate to single pull_request trigger The dual-trigger pattern (pull_request + pull_request_target with concurrency cancel-in-progress) shipped in #865 was over-engineered. It produced TWO 'gate' check-runs per SHA -- one SUCCESS, one CANCELLED -- and branch protection's status-check rollup treats CANCELLED as failure, so PRs were silently BLOCKED unless an admin overrode (which masked the bug on #867). GitHub Actions has no primitive for 'either of these events succeeded'. World-class OSS projects (kubernetes, rust, deno, next.js) accept this and use a single trigger. The cost: a dropped 'pull_request' webhook (rare; observed once on PR #856) requires manual recovery. Recovery paths now documented at top of file: - push empty commit - gh workflow run merge-gate.yml -f pr_number=NNN - close + reopen PR Replaces the dual-trigger + bootstrap-fetch dance with a clean two-job flow: resolve-sha (handles workflow_dispatch input or PR head) then gate (sparse checkout + run script). Same script, same exit codes, same EXPECTED_CHECKS env. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: collapse merge-gate into a single job (one check-run in PR UI) The two-job split (resolve-sha + gate) created two visible check-runs. Inlining the SHA resolution as a step within the gate job leaves only one check-run -- 'Merge Gate / gate' -- on the PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes #852
TL;DR
Adds Entra ID (AAD) bearer-token authentication for Azure DevOps as a transparent fallback to
ADO_APM_PAT. When PAT creation is disabled (e.g. corporate policy),apm install dev.azure.com/...now works as long as the user is signed in viaaz login. Stale/revoked PATs fall back to bearer silently with a[!]diagnostic. No new commands, no breaking changes.What changed
core/azure_cli.py(new)core/auth.py,core/token_manager.pydeps/github_downloader.py,install/validation.py,utils/github_host.pytests/unit/test_azure_cli.py, extensions totest_auth.py, newtests/integration/test_ado_bearer_e2e.py.github/workflows/auth-acceptance.yml(ado-bearer-testsjob)Architecture (resolution chain)
flowchart TD A[apm install ado-host] --> B{ADO_APM_PAT set?} B -->|yes| C[Try PAT via http.extraheader] C -->|HTTP 401/403| D[Fallback to bearer] C -->|success| Z[Done] B -->|no| D D --> E{az on PATH?} E -->|no| F[Case 1 error: install az login] E -->|yes| G[az account get-access-token --resource 499b84ac-...] G -->|token returned| H[git clone + Authorization: Bearer] G -->|no signed-in account| I[Case 1 error] H -->|HTTP 401/403| J[Case 2 error: wrong tenant] H -->|success| ZLive integration tests
TestBearerOnly::test_install_via_bearer_onlyapm_modules/TestBearerOnly::test_verbose_shows_bearer_sourceAAD_BEARER_AZ_CLITestStalePatFallback::test_bogus_pat_falls_back_to_bearer[!]diagnosticTestWrongTenant::test_wrong_tenant_renders_case_2_errorTestPatRegression::test_pat_install_unchangedtest_ado_e2e.py* Skips are documented inline as maintainer-only or covered elsewhere. See
tests/integration/test_ado_bearer_e2e.pymodule docstring.Unit suite: 5207 passed (1 deselected: pre-existing network test on
origin/main).Full test report (with stdout snippets): see
~/.copilot/session-state/.../files/test-report-852.md(attached).Doc changes
docs/getting-started/authentication.mddocs/integrations/ci-cd.mdAzureCLI@2/WIF + GH Actionsazure/login@v2snippetsdocs/enterprise/security.mddocs/reference/cli-commands.md--verbosementions auth-source linepackages/apm-guide/.apm/skills/apm-usage/authentication.mdSecurity posture
499b84ac-1321-427f-aa17-267ca6975798) -- token cannot be replayed against ARM, Graph, or other Azure resources.argvor URL credential userinfo. They flow exclusively viaGIT_CONFIG_COUNT/KEY/VALUEenv vars (clone) andAuthorization: Bearerheader (REST).AuthContext.tokenisfield(repr=False)-- repr/log messages cannot leak the token (B1)._build_git_envskipsGIT_TOKENwhen scheme is bearer -- no PAT-style env injection (B2).AzureCliBearerProvidershares the in-memory token cache, so we don't re-shell out for every callsite (B3).subprocess.run(list, shell=False)-- no shell-injection surface from CI variables (B4).apm-review-panel verdict
APPROVE-WITH-CHANGES (post-Wave-5). All 7 BLOCKING items addressed in commit
dadd8389:Should-fix items will be filed as separate issues linked to #852 (extract execute_with_bearer_fallback helper, thread verbose state through AuthResolver, honor expiresOn in cache, route verbose source through CommandLogger, expand troubleshooting docs).
Reviewer sign-off
[!]diagnostic routing, verbose line wording)Open follow-ups
ado-bearer-testsCI job (maintainer infra; job is opt-in viainputs.ado_bearer == trueuntil provisioned).