Leverage Deployment Stacks for idempotency

## Summary

The current destroy flow in [`.github/workflows/git-ape-destroy.exampleyml`](https://github.com/Azure/git-ape/blob/main/.github/workflows/git-ape-destroy.exampleyml) primarily deletes a single resource group (`az group delete`) plus a narrow sweep for subscription-scope `Microsoft.Authorization/*` and `Microsoft.Authorization/policyAssignments` resources discovered via `az deployment operation sub list`.

This works for the single-RG Key Vault template we ship today, but it is **not idempotent** once a deployment spans more than one resource group, creates subscription/MG-scope resources via nested deployments, or creates soft-deletable services (Key Vault, APIM, Log Analytics, App Configuration, Cognitive Services, Recovery Services, ML workspace, …).

Observed concretely after running `@git-ape destroy deployment deploy-20260423-092136` (single-RG Key Vault with purge protection): the RG is gone but the Key Vault remains **soft-deleted at subscription scope** for 90 days and cannot be purged (purge protection enabled). Re-running the exact same template will fail with `VaultAlreadyExists` until retention expires — destroy + redeploy is not idempotent.

## Orphan categories a "delete the RG" strategy can leave behind

| # | Category | Example |
|---|----------|---------|
| 1 | Soft-deleted data services | Key Vault, APIM, Cognitive Services, App Configuration, Log Analytics workspace, Recovery Services vault, ML workspace |
| 2 | Purge-protected resources | Key Vault with `enablePurgeProtection: true` |
| 3 | Multiple resource groups | Template creates `rg-app` + `rg-data` — only one is tracked in `state.resourceGroup` |
| 4 | Subscription-scope role assignments created via nested deployments | Not always enumerable through `az deployment operation sub list` |
| 5 | Subscription-scope policy assignments / definitions / exemptions | Same as above |
| 6 | Management-group-scope resources | Custom policies, role assignments at MG scope |
| 7 | Cross-RG resources from nested deployments | VNet peering in a hub RG, DNS record in a shared DNS RG, secret in a shared KV |
| 8 | Cross-subscription nested deployments | Destroy runs against one subscription only |
| 9 | Tenant / Entra ID objects | App registrations, directory groups |
| 10 | Backup protected items / recovery points in cross-RG Recovery Services vaults | Survive source-RG delete |
| 11 | Subscription-level diagnostic settings | `microsoft.insights/diagnosticSettings` at sub scope |
| 12 | Subscription budgets & cost alerts | `Microsoft.Consumption/budgets` |
| 13 | Resource locks | Don't orphan but *block* delete and leave partial state |
| 14 | Remote-side references | Approved Private Endpoint connections on a shared service, remote VNet peerings, DNS records in shared zones |
| 15 | Subscription deployment-history entries | Accumulate toward the 800/scope limit |

## Proposed approach — two layers

### Layer A — Azure Deployment Stacks (primary, for new deployments)

[Deployment Stacks](https://learn.microsoft.com/azure/azure-resource-manager/bicep/deployment-stacks) natively track every resource in a deployment regardless of scope.

- Replace `az deployment sub create` with `az stack sub create --action-on-unmanage deleteAll --deny-settings-mode denyDelete` in `git-ape-deploy.exampleyml`.
- Stack name = deployment id; store it in `state.stackId`.
- Destroy becomes a single `az stack sub delete --action-on-unmanage deleteAll`, covering multi-RG, sub-scope, and MG-scope uniformly.
- Remaining gaps to handle explicitly: soft-delete purge (1, 2) and remote-side references (14) — stacks don't handle either.

### Layer B — State-driven fallback (retrofits existing + legacy deployments)

For pre-stack deployments and cases where stacks can't be used:

1. **Capture-at-deploy:** walk the deployment-operation graph recursively (root + every nested op) and emit a flat list of every `targetResource.id` into `state.managedResources[]` with `{id, type, scope, apiVersion, softDeletable, purgeProtected}`. Also populate `state.resourceGroups[]`, `state.subscriptions[]`, `state.externalReferences[]`, `state.stackId` (nullable).
2. **Destroy algorithm (idempotent):**
   1. If `stackId` present → `az stack sub delete`; skip to step 7.
   2. Topologically sort `managedResources[]` (locks → role/policy assignments → children → parents → RGs).
   3. For each resource: `az resource show` → if 404 mark `already-gone`; else delete; retry transient.
   4. For each RG in `resourceGroups[]`: `az group delete --yes`.
   5. For each `softDeletable[]` entry: list soft-deleted → purge if `purgeProtected=false`, else record `retained-soft-deleted` with expiry date.
   6. Probe `externalReferences[]` for remote-side leftovers (stale PE connections, peerings, DNS records).
   7. Delete subscription deployment-history entry for the deployment.
   8. Write terminal status with per-resource outcome. Re-runs converge to the same end state.

## Proposed schema changes

Extend `state.json`:

```jsonc
{
  "stackId": "string | null",
  "managedResources": [
    { "id": "/subscriptions/.../Microsoft.KeyVault/vaults/foo",
      "type": "Microsoft.KeyVault/vaults",
      "scope": "resourceGroup",
      "apiVersion": "2024-11-01",
      "softDeletable": true,
      "purgeProtected": true }
  ],
  "resourceGroups": ["rg-app", "rg-data"],
  "subscriptions": ["<subId>"],
  "externalReferences": [
    { "kind": "privateEndpointConnection", "targetResourceId": "..." }
  ]
}
```

Extend `metadata.json`: `resourceGroup` (string) → `resourceGroups` (array). Add `scope` to allow `subscription | managementGroup`.

Add new status values to [`docs/DEPLOYMENT_STATE.md`](https://github.com/Azure/git-ape/blob/main/docs/DEPLOYMENT_STATE.md): `retained-soft-deleted`, `partially-destroyed`.

## Implementation phases

- **Phase 1 — Schema & state capture:** extend `state.json` / `metadata.json`, update `DEPLOYMENT_STATE.md`, update `azure-template-generator.agent.md` / deploy agent to walk deployment operations after deploy and populate `managedResources[]`.
- **Phase 2 — Deployment Stacks integration:** add `deployMethod` toggle (default `stack`) in requirements gathering; stack create in `git-ape-deploy.exampleyml`; stack-delete branch in `git-ape-destroy.exampleyml`.
- **Phase 3 — Fallback hardening:** extract destroy logic into `.github/scripts/destroy.sh` (or `.ps1`) implementing the idempotent algorithm above; add soft-delete purge loop + remote-reference probe.
- **Phase 4 — Validation:** fixture deployment with 2 RGs + purge-protected KV + sub-scope role assignment + cross-RG reference; destroy → re-run destroy (must be `already-destroyed`); stack-vs-fallback parity; soft-delete replay (redeploy succeeds once retention allows).

## Out of scope

- Entra ID / app-registration cleanup (requires Graph permissions; separate issue).
- Data-plane cleanup (KV secrets, blob contents — gone with control plane).
- Management-group-scope deployments (noted but deferred).

## Open questions for discussion

1. **Stacks opt-in or default?** Recommend `stack` as the default for new deployments, keeping `sub-deployment` as an explicit fallback. Stacks are GA.
2. **Auto-purge non-protected soft-deleted resources?** Recommend yes on destroy (never purge protected); surface both in the summary. Alternative: require an explicit `--purge-soft-deleted` flag.
3. **Clean up deployment-history entries after destroy?** Recommend yes (to stay well below the 800/scope cap).
4. **Scope of this work:** single issue or should each phase be split into its own issue once we align on direction?

## Reproduction

1. Deploy the included Key Vault + private endpoint template (`.azure/deployments/deploy-20260423-092136`).
2. Run `@git-ape destroy deployment deploy-20260423-092136`.
3. Observe: RG is deleted; Key Vault remains soft-deleted at subscription scope; purge protection prevents purge; redeploying with the same name fails until retention expires.

---

Happy to open a draft PR for Phase 1 (schema + capture) as the foundation once we align on the two-layer direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leverage Deployment Stacks for idempotency #30

Summary

Orphan categories a "delete the RG" strategy can leave behind

Proposed approach — two layers

Layer A — Azure Deployment Stacks (primary, for new deployments)

Layer B — State-driven fallback (retrofits existing + legacy deployments)

Proposed schema changes

Implementation phases

Out of scope

Open questions for discussion

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Category	Example
1	Soft-deleted data services	Key Vault, APIM, Cognitive Services, App Configuration, Log Analytics workspace, Recovery Services vault, ML workspace
2	Purge-protected resources	Key Vault with `enablePurgeProtection: true`
3	Multiple resource groups	Template creates `rg-app` + `rg-data` — only one is tracked in `state.resourceGroup`
4	Subscription-scope role assignments created via nested deployments	Not always enumerable through `az deployment operation sub list`
5	Subscription-scope policy assignments / definitions / exemptions	Same as above
6	Management-group-scope resources	Custom policies, role assignments at MG scope
7	Cross-RG resources from nested deployments	VNet peering in a hub RG, DNS record in a shared DNS RG, secret in a shared KV
8	Cross-subscription nested deployments	Destroy runs against one subscription only
9	Tenant / Entra ID objects	App registrations, directory groups
10	Backup protected items / recovery points in cross-RG Recovery Services vaults	Survive source-RG delete
11	Subscription-level diagnostic settings	`microsoft.insights/diagnosticSettings` at sub scope
12	Subscription budgets & cost alerts	`Microsoft.Consumption/budgets`
13	Resource locks	Don't orphan but block delete and leave partial state
14	Remote-side references	Approved Private Endpoint connections on a shared service, remote VNet peerings, DNS records in shared zones
15	Subscription deployment-history entries	Accumulate toward the 800/scope limit

Leverage Deployment Stacks for idempotency #30

Description

Summary

Orphan categories a "delete the RG" strategy can leave behind

Proposed approach — two layers

Layer A — Azure Deployment Stacks (primary, for new deployments)

Layer B — State-driven fallback (retrofits existing + legacy deployments)

Proposed schema changes

Implementation phases

Out of scope

Open questions for discussion

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions