fix: tolerate degraded aggregated APIServices during API resource discovery#450
Merged
kodiakhq[bot] merged 2 commits intoMay 12, 2026
Merged
Conversation
…covery Guard fatally crashes on startup when aggregated APIServices return 0-length HTTP 200 responses during API resource discovery. This causes a crash loop that prevents Guard readiness, which cascades into OverlayManager UpgradeRelease timeouts across affected clusters. The existing error handling only tolerated "unreachable" APIService errors but not empty-body responses. Since ServerPreferredResources() returns partial results for any individual API group failure, we now tolerate all errors that contain the partial discovery error prefix rather than checking for a specific sub-error type. This fix resolves ICM 792913310 where 59 clusters in centralus had Guard crash-looping due to nodeagent.microsoft.com/v1alpha1 returning empty responses, causing 6,365 Guard crashes in 6 hours. Signed-off-by: Artem Kolomeetc <akolomeetc@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fetchApiResources()to tolerate all partial API discovery errors (not just "unreachable" errors), sinceServerPreferredResources()always returns valid partial results alongside these errorsapiserviceUnreachableErrorconstantContext
ICM 792913310 - 59 clusters in centralus have Guard crash-looping because
nodeagent.microsoft.com/v1alpha1aggregated APIService returns 0-length HTTP 200 responses. Guard (server.go:261) fatally exits when API resource discovery fails, causing 6,365 crashes in 6 hours. This cascades into OverlayManagerUpgradeReleasetimeouts since Guard readiness is a precondition for Helm release upgrades.The existing error handling checked for
retrieveServerApiErrorANDapiserviceUnreachableError, but the 0-length response error only matched the first condition. TheServerPreferredResources()function returns partial results whenever theretrieveServerApiErrorprefix is present, so checking for that prefix alone is sufficient and more resilient to future error patterns.Affected API groups in production:
nodeagent.microsoft.com/v1alpha1(3,677 crashes) - Azure CNI Node Agentextensions.containerapp.microsoft.com/v1alpha1(202 crashes)containerapp.lgnvk.microsoft.com/v1alph