[Azure] surface underlying ARM error when LRO deployment fails#85
Open
scuba10steve wants to merge 1 commit into
Open
[Azure] surface underlying ARM error when LRO deployment fails#85scuba10steve wants to merge 1 commit into
scuba10steve wants to merge 1 commit into
Conversation
Azure SDK's LRO poller throws a generic "Long running operation failed." with no detail when an ARM deployment terminates in Failed state. The real error (e.g. "does not support availability zones at location 'westus'") lives in each DeploymentOperation.statusMessage. Clouddriver only extracted that on the success polling path; on failure the catch block surfaced only the SDK's generic message. In createTemplateDeployment, on exception, query the deployment's operations and append each FAILED operation's statusMessage to the rethrown exception's message. Best-effort with bounded retry (3x1s) to ride out the race between LRO terminal-failure and ARM materializing the per-op rows. Filter to recent failures by timestamp window so retries against deterministic deployment names don't attach stale prior-attempt errors. Skip ops with null targetResource (known SDK noise). Preserve ManagementException type when rethrowing so downstream catch blocks keep matching. Narrow catch from Throwable to Exception so JVM Errors propagate cleanly. Extracts the statusMessage rendering and failure extraction into AzureDeploymentOperation so the existing checkDeploymentOperationStatus poller and the new failure path share the same logic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AzureException("Long running operation failed.")with no detail when an ARM deployment terminates inFailed. The actual ARM error (e.g.does not support availability zones at location 'westus',InvalidResourceReference,properties.upgradePolicy.mode = null) lives in eachDeploymentOperation.statusMessageand was previously only extracted on the success polling path. On failure, clouddriver surfaced only the SDK's generic message to the Spinnaker UI, leaving operators to dig through clouddriver logs.AzureResourceManagerClient.createTemplateDeploymentnow catches the exception, queries the deployment's operations, and rethrows with eachFAILEDoperation'sstatusMessageappended:<original> :: [<resourceType>/<resourceName>] <code>: <message> | ....targetResourcefilter to suppress known SDK noise;ManagementExceptiontype preserved when rethrowing so downstreamcatch (ManagementException)blocks keep matching; catch narrowed fromThrowabletoExceptionso JVMErrors propagate cleanly; nulle.messagedefaults to a sensible class-name string.renderStatusMessage,extractFailedResources, andfilterToRecentFailuresextracted toAzureDeploymentOperation. The success-pathcheckDeploymentOperationStatusand the new failure path now share the rendering logic.Test plan
AzureResourceManagerClientSpec(8 tests): empty/all-succeeded → original message; single FAILED appends statusMessage; multiple FAILED + Succeeded interleaved; missing statusMessage / missing targetResource / phantom null-target ops; stale failures filtered by timestamp window;wrapAsRichExceptionpreservesManagementExceptiontype and wraps non-Management exceptions inRuntimeException.AzureDeploymentOperationSpec(9 tests):extractFailedResourcesskips null-target SDK noise, skips Succeeded, returns empty for null/empty input;filterToRecentFailureskeeps within window / handles missing timestamps / keeps null-timestamp failures defensively;renderStatusMessagehandles null / code+message / code-only / message-only / status-only / non-StatusMessage object;FailedResourceDetail.label()formatting../gradlew :clouddriver:clouddriver-azure:testruns all 17 new tests green; broader azure suite goes 177 → 194 tests with the same 25 pre-existing template/converter failures (verified by stashing this diff and re-running on the unmodified branch — same failures, unrelated).