Skip to content

Leverage Deployment Stacks for idempotency #30

@arnaudlh

Description

@arnaudlh

Summary

The current destroy flow in .github/workflows/git-ape-destroy.exampleyml primarily deletes a single resource group (az group delete) plus a narrow sweep for subscription-scope Microsoft.Authorization/* and Microsoft.Authorization/policyAssignments resources discovered via az deployment operation sub list.

This works for the single-RG Key Vault template we ship today, but it is not idempotent once a deployment spans more than one resource group, creates subscription/MG-scope resources via nested deployments, or creates soft-deletable services (Key Vault, APIM, Log Analytics, App Configuration, Cognitive Services, Recovery Services, ML workspace, …).

Observed concretely after running @git-ape destroy deployment deploy-20260423-092136 (single-RG Key Vault with purge protection): the RG is gone but the Key Vault remains soft-deleted at subscription scope for 90 days and cannot be purged (purge protection enabled). Re-running the exact same template will fail with VaultAlreadyExists until retention expires — destroy + redeploy is not idempotent.

Orphan categories a "delete the RG" strategy can leave behind

# Category Example
1 Soft-deleted data services Key Vault, APIM, Cognitive Services, App Configuration, Log Analytics workspace, Recovery Services vault, ML workspace
2 Purge-protected resources Key Vault with enablePurgeProtection: true
3 Multiple resource groups Template creates rg-app + rg-data — only one is tracked in state.resourceGroup
4 Subscription-scope role assignments created via nested deployments Not always enumerable through az deployment operation sub list
5 Subscription-scope policy assignments / definitions / exemptions Same as above
6 Management-group-scope resources Custom policies, role assignments at MG scope
7 Cross-RG resources from nested deployments VNet peering in a hub RG, DNS record in a shared DNS RG, secret in a shared KV
8 Cross-subscription nested deployments Destroy runs against one subscription only
9 Tenant / Entra ID objects App registrations, directory groups
10 Backup protected items / recovery points in cross-RG Recovery Services vaults Survive source-RG delete
11 Subscription-level diagnostic settings microsoft.insights/diagnosticSettings at sub scope
12 Subscription budgets & cost alerts Microsoft.Consumption/budgets
13 Resource locks Don't orphan but block delete and leave partial state
14 Remote-side references Approved Private Endpoint connections on a shared service, remote VNet peerings, DNS records in shared zones
15 Subscription deployment-history entries Accumulate toward the 800/scope limit

Proposed approach — two layers

Layer A — Azure Deployment Stacks (primary, for new deployments)

Deployment Stacks natively track every resource in a deployment regardless of scope.

  • Replace az deployment sub create with az stack sub create --action-on-unmanage deleteAll --deny-settings-mode denyDelete in git-ape-deploy.exampleyml.
  • Stack name = deployment id; store it in state.stackId.
  • Destroy becomes a single az stack sub delete --action-on-unmanage deleteAll, covering multi-RG, sub-scope, and MG-scope uniformly.
  • Remaining gaps to handle explicitly: soft-delete purge (1, 2) and remote-side references (14) — stacks don't handle either.

Layer B — State-driven fallback (retrofits existing + legacy deployments)

For pre-stack deployments and cases where stacks can't be used:

  1. Capture-at-deploy: walk the deployment-operation graph recursively (root + every nested op) and emit a flat list of every targetResource.id into state.managedResources[] with {id, type, scope, apiVersion, softDeletable, purgeProtected}. Also populate state.resourceGroups[], state.subscriptions[], state.externalReferences[], state.stackId (nullable).
  2. Destroy algorithm (idempotent):
    1. If stackId present → az stack sub delete; skip to step 7.
    2. Topologically sort managedResources[] (locks → role/policy assignments → children → parents → RGs).
    3. For each resource: az resource show → if 404 mark already-gone; else delete; retry transient.
    4. For each RG in resourceGroups[]: az group delete --yes.
    5. For each softDeletable[] entry: list soft-deleted → purge if purgeProtected=false, else record retained-soft-deleted with expiry date.
    6. Probe externalReferences[] for remote-side leftovers (stale PE connections, peerings, DNS records).
    7. Delete subscription deployment-history entry for the deployment.
    8. Write terminal status with per-resource outcome. Re-runs converge to the same end state.

Proposed schema changes

Extend state.json:

{
  "stackId": "string | null",
  "managedResources": [
    { "id": "/subscriptions/.../Microsoft.KeyVault/vaults/foo",
      "type": "Microsoft.KeyVault/vaults",
      "scope": "resourceGroup",
      "apiVersion": "2024-11-01",
      "softDeletable": true,
      "purgeProtected": true }
  ],
  "resourceGroups": ["rg-app", "rg-data"],
  "subscriptions": ["<subId>"],
  "externalReferences": [
    { "kind": "privateEndpointConnection", "targetResourceId": "..." }
  ]
}

Extend metadata.json: resourceGroup (string) → resourceGroups (array). Add scope to allow subscription | managementGroup.

Add new status values to docs/DEPLOYMENT_STATE.md: retained-soft-deleted, partially-destroyed.

Implementation phases

  • Phase 1 — Schema & state capture: extend state.json / metadata.json, update DEPLOYMENT_STATE.md, update azure-template-generator.agent.md / deploy agent to walk deployment operations after deploy and populate managedResources[].
  • Phase 2 — Deployment Stacks integration: add deployMethod toggle (default stack) in requirements gathering; stack create in git-ape-deploy.exampleyml; stack-delete branch in git-ape-destroy.exampleyml.
  • Phase 3 — Fallback hardening: extract destroy logic into .github/scripts/destroy.sh (or .ps1) implementing the idempotent algorithm above; add soft-delete purge loop + remote-reference probe.
  • Phase 4 — Validation: fixture deployment with 2 RGs + purge-protected KV + sub-scope role assignment + cross-RG reference; destroy → re-run destroy (must be already-destroyed); stack-vs-fallback parity; soft-delete replay (redeploy succeeds once retention allows).

Out of scope

  • Entra ID / app-registration cleanup (requires Graph permissions; separate issue).
  • Data-plane cleanup (KV secrets, blob contents — gone with control plane).
  • Management-group-scope deployments (noted but deferred).

Open questions for discussion

  1. Stacks opt-in or default? Recommend stack as the default for new deployments, keeping sub-deployment as an explicit fallback. Stacks are GA.
  2. Auto-purge non-protected soft-deleted resources? Recommend yes on destroy (never purge protected); surface both in the summary. Alternative: require an explicit --purge-soft-deleted flag.
  3. Clean up deployment-history entries after destroy? Recommend yes (to stay well below the 800/scope cap).
  4. Scope of this work: single issue or should each phase be split into its own issue once we align on direction?

Reproduction

  1. Deploy the included Key Vault + private endpoint template (.azure/deployments/deploy-20260423-092136).
  2. Run @git-ape destroy deployment deploy-20260423-092136.
  3. Observe: RG is deleted; Key Vault remains soft-deleted at subscription scope; purge protection prevents purge; redeploying with the same name fails until retention expires.

Happy to open a draft PR for Phase 1 (schema + capture) as the foundation once we align on the two-layer direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions