fix: tolerate degraded aggregated APIServices during API resource discovery by arxhive · Pull Request #450 · kubeguard/guard

arxhive · 2026-05-09T06:11:16Z

Summary

Guard fatally crashes on startup when aggregated APIServices return 0-length HTTP 200 responses during API resource discovery, causing a crash loop that prevents Guard readiness
Broadened the error handling in fetchApiResources() to tolerate all partial API discovery errors (not just "unreachable" errors), since ServerPreferredResources() always returns valid partial results alongside these errors
Removed the now-unused apiserviceUnreachableError constant

Context

ICM 792913310 - 59 clusters in centralus have Guard crash-looping because nodeagent.microsoft.com/v1alpha1 aggregated APIService returns 0-length HTTP 200 responses. Guard (server.go:261) fatally exits when API resource discovery fails, causing 6,365 crashes in 6 hours. This cascades into OverlayManager UpgradeRelease timeouts since Guard readiness is a precondition for Helm release upgrades.

The existing error handling checked for retrieveServerApiError AND apiserviceUnreachableError, but the 0-length response error only matched the first condition. The ServerPreferredResources() function returns partial results whenever the retrieveServerApiError prefix is present, so checking for that prefix alone is sufficient and more resilient to future error patterns.

Affected API groups in production:

nodeagent.microsoft.com/v1alpha1 (3,677 crashes) - Azure CNI Node Agent
extensions.containerapp.microsoft.com/v1alpha1 (202 crashes)
containerapp.lgnvk.microsoft.com/v1alph

…covery Guard fatally crashes on startup when aggregated APIServices return 0-length HTTP 200 responses during API resource discovery. This causes a crash loop that prevents Guard readiness, which cascades into OverlayManager UpgradeRelease timeouts across affected clusters. The existing error handling only tolerated "unreachable" APIService errors but not empty-body responses. Since ServerPreferredResources() returns partial results for any individual API group failure, we now tolerate all errors that contain the partial discovery error prefix rather than checking for a specific sub-error type. This fix resolves ICM 792913310 where 59 clusters in centralus had Guard crash-looping due to nodeagent.microsoft.com/v1alpha1 returning empty responses, causing 6,365 Guard crashes in 6 hours. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>

weinong

lgtm

arxhive requested a review from a team as a code owner May 9, 2026 06:11

weinong approved these changes May 11, 2026

View reviewed changes

Merge branch 'master' into fix/tolerate-degraded-apiservices

77b627f

weinong added the automerge Kodiak will auto merge PRs that have this label label May 12, 2026

kodiakhq Bot merged commit a70fda4 into kubeguard:master May 12, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tolerate degraded aggregated APIServices during API resource discovery#450

fix: tolerate degraded aggregated APIServices during API resource discovery#450
kodiakhq[bot] merged 2 commits into
kubeguard:masterfrom
arxhive:fix/tolerate-degraded-apiservices

arxhive commented May 9, 2026 •

edited

Loading

Uh oh!

weinong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arxhive commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Uh oh!

weinong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arxhive commented May 9, 2026 •

edited

Loading