Skip to content

fix: tolerate degraded aggregated APIServices during API resource discovery#450

Merged
kodiakhq[bot] merged 2 commits into
kubeguard:masterfrom
arxhive:fix/tolerate-degraded-apiservices
May 12, 2026
Merged

fix: tolerate degraded aggregated APIServices during API resource discovery#450
kodiakhq[bot] merged 2 commits into
kubeguard:masterfrom
arxhive:fix/tolerate-degraded-apiservices

Conversation

@arxhive
Copy link
Copy Markdown
Contributor

@arxhive arxhive commented May 9, 2026

Summary

  • Guard fatally crashes on startup when aggregated APIServices return 0-length HTTP 200 responses during API resource discovery, causing a crash loop that prevents Guard readiness
  • Broadened the error handling in fetchApiResources() to tolerate all partial API discovery errors (not just "unreachable" errors), since ServerPreferredResources() always returns valid partial results alongside these errors
  • Removed the now-unused apiserviceUnreachableError constant

Context

ICM 792913310 - 59 clusters in centralus have Guard crash-looping because nodeagent.microsoft.com/v1alpha1 aggregated APIService returns 0-length HTTP 200 responses. Guard (server.go:261) fatally exits when API resource discovery fails, causing 6,365 crashes in 6 hours. This cascades into OverlayManager UpgradeRelease timeouts since Guard readiness is a precondition for Helm release upgrades.

The existing error handling checked for retrieveServerApiError AND apiserviceUnreachableError, but the 0-length response error only matched the first condition. The ServerPreferredResources() function returns partial results whenever the retrieveServerApiError prefix is present, so checking for that prefix alone is sufficient and more resilient to future error patterns.

Affected API groups in production:

  • nodeagent.microsoft.com/v1alpha1 (3,677 crashes) - Azure CNI Node Agent
  • extensions.containerapp.microsoft.com/v1alpha1 (202 crashes)
  • containerapp.lgnvk.microsoft.com/v1alph

…covery

Guard fatally crashes on startup when aggregated APIServices return
0-length HTTP 200 responses during API resource discovery. This causes
a crash loop that prevents Guard readiness, which cascades into
OverlayManager UpgradeRelease timeouts across affected clusters.

The existing error handling only tolerated "unreachable" APIService
errors but not empty-body responses. Since ServerPreferredResources()
returns partial results for any individual API group failure, we now
tolerate all errors that contain the partial discovery error prefix
rather than checking for a specific sub-error type.

This fix resolves ICM 792913310 where 59 clusters in centralus had
Guard crash-looping due to nodeagent.microsoft.com/v1alpha1 returning
empty responses, causing 6,365 Guard crashes in 6 hours.

Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
@arxhive arxhive requested a review from a team as a code owner May 9, 2026 06:11
Copy link
Copy Markdown
Contributor

@weinong weinong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@weinong weinong added the automerge Kodiak will auto merge PRs that have this label label May 12, 2026
@kodiakhq kodiakhq Bot merged commit a70fda4 into kubeguard:master May 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge Kodiak will auto merge PRs that have this label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants