fix: prevent token refresh storm in CheckAccess v2 and add rollout observability#449
Merged
kodiakhq[bot] merged 21 commits intoMay 19, 2026
Conversation
9a9f916 to
6ca059a
Compare
weinong
reviewed
May 8, 2026
weinong
reviewed
May 8, 2026
ed35098 to
6ca059a
Compare
weinong
reviewed
May 9, 2026
| @@ -0,0 +1,24 @@ | |||
| # Guard | |||
Contributor
There was a problem hiding this comment.
please scrub this file to remove aks internals
Contributor
Author
There was a problem hiding this comment.
Removed CCP, OBO service, and Helm chart references from CLAUDE.md and the command file.
weinong
reviewed
May 9, 2026
|
|
||
| ### Prerequisites | ||
|
|
||
| - Azure subscription: `AKS INT/Staging Test` (`az account set --subscription 'AKS INT/Staging Test'`) |
Contributor
Author
There was a problem hiding this comment.
Fixed, + obfuscated the rest of variables.
added 17 commits
May 19, 2026 00:10
…servability
The tokenProviderAdapter.GetToken() accepted tokens with ExpiresOn=0 or
already-expired timestamps, causing the Azure SDK's BearerTokenPolicy to
refresh the token on every request. Under concurrent load this amplified
into ~4x /authz/token calls per checkAccess request, overwhelming the
API server (ICM 793110894).
Changes:
- Validate ExpiresOn in tokenProviderAdapter - reject zero/expired tokens
instead of returning them to the SDK cache
- Store PDP scope in the adapter for future scope-aware token acquisition
- Add api_version label ("v1"/"v2") to all checkAccess Prometheus metrics
for rollout monitoring
- Promote v1/v2 routing decision and v2 batch logs from V(5)/V(7) to V(0)
so they appear in production logs and Kusto
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Per-batch logging at V(0) would be too noisy in production. Keep batch start at V(7) and batch success at V(5), matching v1 behavior. The v1/v2 routing decision in CheckAccess() remains at V(0) for rollout tracking. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
- Remove double-counting of checkAccessTotal and checkAccessDuration metrics on io.ReadAll failure path (lines 623-624 already counted the request before the body read) - Demote token acquisition logs from V(0) to V(5) to avoid log volume issues; rollout signal is covered by api_version Prometheus label Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Add realistic mock PDP service for testing Guard's CheckAccess v2 flow end-to-end in staging clusters without access to production CCP/OBO infrastructure. Mock server changes (tests/mock-server/): - Add v2 PDP endpoint (/checkaccess/v2) with realistic response model including roleAssignment details that Guard parses - Add OBO authztoken endpoint (/v1/<ccpid>/authztoken) matching production token exchange pattern - Support dual-mode: HTTP on :8080 (OBO tokens) + HTTPS on :8443 (PDP) since Azure SDK requires TLS for authenticated requests - Add v2 request/response types matching checkaccess-v2-go-sdk Staging test scripts (test/staging/): - deploy-guard-v2-test.sh: deploys Guard with v2 enabled + mock PDP - test-guard-v2.sh: sends SubjectAccessReview and checks logs - guard-v2-test.yaml: Guard deployment manifest - mock-pdp-service.yaml: mock PDP deployment manifest Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Move test/staging/ to tests/staging/ to keep all test infrastructure (k6, mock-server, staging) under a single tests/ directory. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
- CLAUDE.md: project architecture, known gotchas, available commands - .claude/commands/guard-staging-test.md: e2e test runbook for deploying Guard with CheckAccess v2 against mock PDP in staging Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Remove CCP, OBO service, and Helm chart references from public files. Architecture details moved to CLAUDE.local.md (not committed). Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
V1 and V2 CheckAccess APIs require different token audiences: - V1: management.core.windows.net (ARM) - routed through ARM proxy - V2: authorization.azure.net (PDP) - calls PDP directly The OBO /authztoken endpoint defaults to ARM-audience tokens. When Guard sends these to PDP directly (v2), PDP rejects with 401 because the token audience doesn't match. Create a separate aksTokenProvider for v2 that includes the PDP resource in the OBO request body. This tells OBO to request a PDP-audience token from AAD instead of the default ARM-audience token. V1 continues using the original provider with ARM-audience tokens. Requires companion OBO service change (aks-rp PR #15683683) to accept the resource field in TokenRequestV1. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
The mock OBO token handler now parses the request body and logs the resource field, matching the updated aksTokenProvider that sends resource for v2 PDP-audience tokens. Defaults to ARM if not provided. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Remove production OBO service URL pattern and internal service references from the command file description. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Replace subscription name, resource group, ACR, and email with generic placeholders to remove author identity from the script. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Add token-proxy (tests/mock-server/token-proxy/) - a thin OBO replacement that gets real PDP-audience tokens from IMDS via a managed identity. This enables testing Guard's full CheckAccess v2 code path against the real Azure PDP endpoint without requiring CCP infrastructure. Add ADX alert rule (tests/staging/checkaccess-v2-alert.yaml) to monitor Guard CCP logs for CheckAccess v2 failures during rollout. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
- Add "Testing Guard Against Real Azure PDP" section documenting the token-proxy approach for calling real PDP from Guard - Fix staging location: westus2 -> eastus2 (VHD provisioning bugs) - Fix VM size: Standard_DS2_v2 -> standard_d2s_v5 (not allowed in staging) - Fix TLS flag names: --ca-cert-file -> --tls-ca-file - Fix PDP endpoint: must include full checkAccess path and api-version (SDK POSTs directly to endpoint URL with no path manipulation) - Document PDP URL format requirement with 404 vs 200 comparison Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Alert rule belongs in a separate monitoring PR, not this feature branch. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Fixes CI check-license failure for Dockerfile and shell scripts in tests/mock-server/ and tests/staging/. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
021f970 to
ee6c5d5
Compare
added 3 commits
May 19, 2026 10:25
The ltag license checker expects a blank line between the shebang and the license header comment block. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Document manual build verification steps for when Docker is unavailable. Covers build, lint, format, unit tests, and license header requirements. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
api_versionPrometheus label for v1/v2 rollout monitoringRoot Cause
tokenProviderAdapter.GetToken()accepted tokens withExpiresOn=0or already-expired timestamps. The Azure SDK'sBearerTokenPolicytreated these as expired and calledGetToken()on every request, generating ~4x/authz/tokencalls per checkAccess request. Under concurrent load this overwhelmed the API server with token acquisition requests, causing 429 throttling on RBAC resources and etcd timeouts (ICM 793110894, eastus2euap).Changes
tokencredential_adapter.goExpiresOninstead of caching them; store PDP scope; log every token acquisition at INFO levelrbac.goapi_versionlabel ("v1"/"v2") tocheckAccessTotal,checkAccessFailed,checkAccessDuration,checkAccessSucceededmetrics; promote v1/v2 routing logs to V(0)checkaccess_v2.gotokencredential_adapter_test.gotests/mock-server/token-proxy/main.goauthz/providers/azure/README.mdReal PDP Validation Results (2026-05-18)
Tested Guard's full CheckAccess v2 code path against the real Azure PDP at
eastus2.authorization.azure.netusing a token-proxy that replaces OBO with IMDS-based token acquisition.Architecture:
Setup:
akolomeetc-v2testin eastus2 with--enable-azure-rbaccheckAccess/actionper eng.ms setup docs)--azure.pdp-endpoint=https://eastus2.authorization.azure.net/providers/microsoft.authorization/checkAccess?api-version=2021-06-01-previewTest results (Guard -> token-proxy -> real PDP):
pods/readin defaultsecrets/deletein kube-systempods/readin defaultpods/getGuard logs confirmed full v2 flow:
Token-proxy logs confirmed real token acquisition:
Key finding - PDP endpoint URL format:
The CheckAccess v2 SDK POSTs directly to
r.endpointwith no path manipulation.https://eastus2.authorization.azure.netalone returns 404. The full path/providers/microsoft.authorization/checkAccess?api-version=2021-06-01-previewmust be included.Additional validation - direct PDP API calls (bypassing Guard):
management.azure.commanagement.azure.comauthorization.azure.netTest plan
api_version="v2"label