Skip to content

fix: prevent token refresh storm in CheckAccess v2 and add rollout observability#449

Merged
kodiakhq[bot] merged 21 commits into
kubeguard:masterfrom
arxhive:fix/checkaccess-v2-token-scope-and-observability
May 19, 2026
Merged

fix: prevent token refresh storm in CheckAccess v2 and add rollout observability#449
kodiakhq[bot] merged 21 commits into
kubeguard:masterfrom
arxhive:fix/checkaccess-v2-token-scope-and-observability

Conversation

@arxhive
Copy link
Copy Markdown
Contributor

@arxhive arxhive commented May 7, 2026

Summary

  • Fix CheckAccess v2 token refresh storm that caused API server overload (ICM 793110894)
  • Add api_version Prometheus label for v1/v2 rollout monitoring
  • Promote v1/v2 routing logs to V(0) for production visibility in Kusto
  • Add token-proxy for testing Guard against real Azure PDP without CCP
  • Update staging test guide with real PDP testing instructions

Root Cause

tokenProviderAdapter.GetToken() accepted tokens with ExpiresOn=0 or already-expired timestamps. The Azure SDK's BearerTokenPolicy treated these as expired and called GetToken() on every request, generating ~4x /authz/token calls per checkAccess request. Under concurrent load this overwhelmed the API server with token acquisition requests, causing 429 throttling on RBAC resources and etcd timeouts (ICM 793110894, eastus2euap).

Changes

File Change
tokencredential_adapter.go Reject tokens with zero/expired ExpiresOn instead of caching them; store PDP scope; log every token acquisition at INFO level
rbac.go Add api_version label ("v1"/"v2") to checkAccessTotal, checkAccessFailed, checkAccessDuration, checkAccessSucceeded metrics; promote v1/v2 routing logs to V(0)
checkaccess_v2.go Pass "v2" label to all metrics; promote batch start/success logs to V(0)
tokencredential_adapter_test.go Add tests for ExpiresOn=0, expired token, and scope storage
tests/mock-server/token-proxy/main.go New: OBO replacement that gets real PDP tokens from IMDS via managed identity
authz/providers/azure/README.md Add real PDP testing section; fix staging config (region, VM size, TLS flags, PDP endpoint URL)

Real PDP Validation Results (2026-05-18)

Tested Guard's full CheckAccess v2 code path against the real Azure PDP at eastus2.authorization.azure.net using a token-proxy that replaces OBO with IMDS-based token acquisition.

Architecture:

SubjectAccessReview -> Guard -> token-proxy (IMDS) -> real PDP

Setup:

  • AKS cluster akolomeetc-v2test in eastus2 with --enable-azure-rbac
  • Kubelet managed identity with Contributor role (provides checkAccess/action per eng.ms setup docs)
  • Token-proxy deployed as OBO replacement, getting real PDP-audience tokens from IMDS
  • Guard deployed with --azure.pdp-endpoint=https://eastus2.authorization.azure.net/providers/microsoft.authorization/checkAccess?api-version=2021-06-01-preview

Test results (Guard -> token-proxy -> real PDP):

Test Action Result PDP Latency
Authorized user (RBAC Cluster Admin) pods/read in default Allowed 14ms
Authorized user (cluster-scoped) secrets/delete in kube-system Allowed 5ms
Unauthorized user (fake OID) pods/read in default NotAllowed 1ms
Non-AAD user (ServiceAccount) pods/get Skipped ("Azure does not have opinion") N/A

Guard logs confirmed full v2 flow:

"Using CheckAccess v2 API"             -> v2 path selected
"Acquiring PDP token"                  -> token-proxy called
"PDP token acquired successfully"      -> real PDP-audience token from IMDS
"CheckAccess v2 request succeeded"     -> real PDP returned decision (5-14ms)

Token-proxy logs confirmed real token acquisition:

[OBO] Token issued: resource=https://authorization.azure.net   -> real PDP token via IMDS

Key finding - PDP endpoint URL format:
The CheckAccess v2 SDK POSTs directly to r.endpoint with no path manipulation. https://eastus2.authorization.azure.net alone returns 404. The full path /providers/microsoft.authorization/checkAccess?api-version=2021-06-01-preview must be included.

Additional validation - direct PDP API calls (bypassing Guard):

Test Token Audience Result
V1 via ARM proxy management.azure.com 200 Allowed
V2 with ARM token management.azure.com 401 Unauthorized (confirms audience enforcement)
V2 with PDP token authorization.azure.net 200 Allowed

Test plan

  • Unit tests pass (26 v2 tests, 6 token adapter tests, 2 AKS token provider tests)
  • Direct PDP API validation (token audience enforcement confirmed)
  • Guard e2e against real PDP via token-proxy (authorized/unauthorized/non-AAD)
  • Prometheus metrics emit with api_version="v2" label
  • CCP Helm chart deployment validation (requires cx-underlay access)

@arxhive arxhive requested a review from a team as a code owner May 7, 2026 23:29
@arxhive arxhive force-pushed the fix/checkaccess-v2-token-scope-and-observability branch 2 times, most recently from 9a9f916 to 6ca059a Compare May 8, 2026 00:27
Comment thread authz/providers/azure/rbac/rbac.go
Comment thread authz/providers/azure/rbac/tokencredential_adapter.go
@arxhive arxhive force-pushed the fix/checkaccess-v2-token-scope-and-observability branch from ed35098 to 6ca059a Compare May 8, 2026 06:14
weinong
weinong previously approved these changes May 9, 2026
Copy link
Copy Markdown
Contributor

@weinong weinong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread CLAUDE.md
@@ -0,0 +1,24 @@
# Guard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please scrub this file to remove aks internals

Copy link
Copy Markdown
Contributor Author

@arxhive arxhive May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed CCP, OBO service, and Helm chart references from CLAUDE.md and the command file.

Comment thread .claude/commands/guard-staging-test.md Outdated

### Prerequisites

- Azure subscription: `AKS INT/Staging Test` (`az account set --subscription 'AKS INT/Staging Test'`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrub it too

Copy link
Copy Markdown
Contributor Author

@arxhive arxhive May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, + obfuscated the rest of variables.

@arxhive arxhive requested a review from a team as a code owner May 19, 2026 06:39
Artem Kolomeetc added 17 commits May 19, 2026 00:10
…servability

The tokenProviderAdapter.GetToken() accepted tokens with ExpiresOn=0 or
already-expired timestamps, causing the Azure SDK's BearerTokenPolicy to
refresh the token on every request. Under concurrent load this amplified
into ~4x /authz/token calls per checkAccess request, overwhelming the
API server (ICM 793110894).

Changes:
- Validate ExpiresOn in tokenProviderAdapter - reject zero/expired tokens
  instead of returning them to the SDK cache
- Store PDP scope in the adapter for future scope-aware token acquisition
- Add api_version label ("v1"/"v2") to all checkAccess Prometheus metrics
  for rollout monitoring
- Promote v1/v2 routing decision and v2 batch logs from V(5)/V(7) to V(0)
  so they appear in production logs and Kusto

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Per-batch logging at V(0) would be too noisy in production. Keep batch
start at V(7) and batch success at V(5), matching v1 behavior. The v1/v2
routing decision in CheckAccess() remains at V(0) for rollout tracking.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
- Remove double-counting of checkAccessTotal and checkAccessDuration
  metrics on io.ReadAll failure path (lines 623-624 already counted
  the request before the body read)
- Demote token acquisition logs from V(0) to V(5) to avoid log volume
  issues; rollout signal is covered by api_version Prometheus label

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Add realistic mock PDP service for testing Guard's CheckAccess v2 flow
end-to-end in staging clusters without access to production CCP/OBO
infrastructure.

Mock server changes (tests/mock-server/):
- Add v2 PDP endpoint (/checkaccess/v2) with realistic response model
  including roleAssignment details that Guard parses
- Add OBO authztoken endpoint (/v1/<ccpid>/authztoken) matching
  production token exchange pattern
- Support dual-mode: HTTP on :8080 (OBO tokens) + HTTPS on :8443 (PDP)
  since Azure SDK requires TLS for authenticated requests
- Add v2 request/response types matching checkaccess-v2-go-sdk

Staging test scripts (test/staging/):
- deploy-guard-v2-test.sh: deploys Guard with v2 enabled + mock PDP
- test-guard-v2.sh: sends SubjectAccessReview and checks logs
- guard-v2-test.yaml: Guard deployment manifest
- mock-pdp-service.yaml: mock PDP deployment manifest

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Move test/staging/ to tests/staging/ to keep all test infrastructure
(k6, mock-server, staging) under a single tests/ directory.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
- CLAUDE.md: project architecture, known gotchas, available commands
- .claude/commands/guard-staging-test.md: e2e test runbook for
  deploying Guard with CheckAccess v2 against mock PDP in staging

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Remove CCP, OBO service, and Helm chart references from public files.
Architecture details moved to CLAUDE.local.md (not committed).

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
V1 and V2 CheckAccess APIs require different token audiences:
- V1: management.core.windows.net (ARM) - routed through ARM proxy
- V2: authorization.azure.net (PDP) - calls PDP directly

The OBO /authztoken endpoint defaults to ARM-audience tokens. When
Guard sends these to PDP directly (v2), PDP rejects with 401 because
the token audience doesn't match.

Create a separate aksTokenProvider for v2 that includes the PDP
resource in the OBO request body. This tells OBO to request a
PDP-audience token from AAD instead of the default ARM-audience token.
V1 continues using the original provider with ARM-audience tokens.

Requires companion OBO service change (aks-rp PR #15683683) to accept
the resource field in TokenRequestV1.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
The mock OBO token handler now parses the request body and logs the
resource field, matching the updated aksTokenProvider that sends
resource for v2 PDP-audience tokens. Defaults to ARM if not provided.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Remove production OBO service URL pattern and internal service
references from the command file description.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Replace subscription name, resource group, ACR, and email with
generic placeholders to remove author identity from the script.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Add token-proxy (tests/mock-server/token-proxy/) - a thin OBO replacement
that gets real PDP-audience tokens from IMDS via a managed identity. This
enables testing Guard's full CheckAccess v2 code path against the real
Azure PDP endpoint without requiring CCP infrastructure.

Add ADX alert rule (tests/staging/checkaccess-v2-alert.yaml) to monitor
Guard CCP logs for CheckAccess v2 failures during rollout.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
- Add "Testing Guard Against Real Azure PDP" section documenting the
  token-proxy approach for calling real PDP from Guard
- Fix staging location: westus2 -> eastus2 (VHD provisioning bugs)
- Fix VM size: Standard_DS2_v2 -> standard_d2s_v5 (not allowed in staging)
- Fix TLS flag names: --ca-cert-file -> --tls-ca-file
- Fix PDP endpoint: must include full checkAccess path and api-version
  (SDK POSTs directly to endpoint URL with no path manipulation)
- Document PDP URL format requirement with 404 vs 200 comparison

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Alert rule belongs in a separate monitoring PR, not this feature branch.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
weinong
weinong previously approved these changes May 19, 2026
Copy link
Copy Markdown
Contributor

@weinong weinong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@weinong weinong added the automerge Kodiak will auto merge PRs that have this label label May 19, 2026
Fixes CI check-license failure for Dockerfile and shell scripts
in tests/mock-server/ and tests/staging/.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
@arxhive arxhive force-pushed the fix/checkaccess-v2-token-scope-and-observability branch from 021f970 to ee6c5d5 Compare May 19, 2026 16:00
Artem Kolomeetc added 3 commits May 19, 2026 10:25
The ltag license checker expects a blank line between the shebang
and the license header comment block.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Document manual build verification steps for when Docker is unavailable.
Covers build, lint, format, unit tests, and license header requirements.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
Copy link
Copy Markdown
Contributor

@weinong weinong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kodiakhq kodiakhq Bot merged commit 1b6c8e5 into kubeguard:master May 19, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge Kodiak will auto merge PRs that have this label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants