Skip to content

[Azure] surface underlying ARM error when LRO deployment fails#85

Open
scuba10steve wants to merge 1 commit into
mainfrom
wt/2
Open

[Azure] surface underlying ARM error when LRO deployment fails#85
scuba10steve wants to merge 1 commit into
mainfrom
wt/2

Conversation

@scuba10steve
Copy link
Copy Markdown
Member

Summary

  • Azure SDK's LRO poller throws AzureException("Long running operation failed.") with no detail when an ARM deployment terminates in Failed. The actual ARM error (e.g. does not support availability zones at location 'westus', InvalidResourceReference, properties.upgradePolicy.mode = null) lives in each DeploymentOperation.statusMessage and was previously only extracted on the success polling path. On failure, clouddriver surfaced only the SDK's generic message to the Spinnaker UI, leaving operators to dig through clouddriver logs.
  • AzureResourceManagerClient.createTemplateDeployment now catches the exception, queries the deployment's operations, and rethrows with each FAILED operation's statusMessage appended: <original> :: [<resourceType>/<resourceName>] <code>: <message> | ....
  • Hardening: bounded retry (3×1s) on the operation lookup to ride out the ARM eventual-consistency race; timestamp window filter so deterministic deployment-name retries don't surface stale prior-attempt errors; null-targetResource filter to suppress known SDK noise; ManagementException type preserved when rethrowing so downstream catch (ManagementException) blocks keep matching; catch narrowed from Throwable to Exception so JVM Errors propagate cleanly; null e.message defaults to a sensible class-name string.
  • DRY: renderStatusMessage, extractFailedResources, and filterToRecentFailures extracted to AzureDeploymentOperation. The success-path checkDeploymentOperationStatus and the new failure path now share the rendering logic.

Test plan

  • New AzureResourceManagerClientSpec (8 tests): empty/all-succeeded → original message; single FAILED appends statusMessage; multiple FAILED + Succeeded interleaved; missing statusMessage / missing targetResource / phantom null-target ops; stale failures filtered by timestamp window; wrapAsRichException preserves ManagementException type and wraps non-Management exceptions in RuntimeException.
  • New AzureDeploymentOperationSpec (9 tests): extractFailedResources skips null-target SDK noise, skips Succeeded, returns empty for null/empty input; filterToRecentFailures keeps within window / handles missing timestamps / keeps null-timestamp failures defensively; renderStatusMessage handles null / code+message / code-only / message-only / status-only / non-StatusMessage object; FailedResourceDetail.label() formatting.
  • ./gradlew :clouddriver:clouddriver-azure:test runs all 17 new tests green; broader azure suite goes 177 → 194 tests with the same 25 pre-existing template/converter failures (verified by stashing this diff and re-running on the unmodified branch — same failures, unrelated).

Azure SDK's LRO poller throws a generic "Long running operation failed."
with no detail when an ARM deployment terminates in Failed state. The
real error (e.g. "does not support availability zones at location 'westus'")
lives in each DeploymentOperation.statusMessage. Clouddriver only extracted
that on the success polling path; on failure the catch block surfaced
only the SDK's generic message.

In createTemplateDeployment, on exception, query the deployment's
operations and append each FAILED operation's statusMessage to the
rethrown exception's message. Best-effort with bounded retry (3x1s) to
ride out the race between LRO terminal-failure and ARM materializing
the per-op rows. Filter to recent failures by timestamp window so
retries against deterministic deployment names don't attach stale
prior-attempt errors. Skip ops with null targetResource (known SDK
noise). Preserve ManagementException type when rethrowing so
downstream catch blocks keep matching. Narrow catch from Throwable
to Exception so JVM Errors propagate cleanly.

Extracts the statusMessage rendering and failure extraction into
AzureDeploymentOperation so the existing checkDeploymentOperationStatus
poller and the new failure path share the same logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant