Companion to AZURE_DEPLOY.md (the Dev guide). This file covers the
Prod variant: Azure Database for PostgreSQL Flexible Server with HA +
backups, private networking, Container Apps Job for managed scheduling,
remote Terraform state, audit logging, and a Dev → Prod data migration
path.
Pre-read: Make sure your Dev environment is working end-to-end first. The Prod IaC follows the same shape — provisioning unfamiliar production Azure resources is the wrong time to debug Docker.
| Layer | Dev | Prod |
|---|---|---|
| PG hosting | Container Apps + postgres:16-alpine + Azure Files | Azure Database for PostgreSQL Flexible Server (managed) |
| PG SLA | None (single container) | 99.9% / 99.95% / 99.99% (HA Disabled / SameZone / ZoneRedundant) |
| Backups | None | PITR 7-35 days, optional geo-redundant |
| Networking | Public ACR + public KV | Private VNet + private endpoints for PG, ACR, KV |
| ACR SKU | Basic ($7/mo) | Premium ($70/mo) — content trust, retention policy, private link |
| Compute | ACI per-run | Container Apps Job in the same VNet (managed scheduling) |
| Image build | Docker on GitHub runner | ACR Tasks remote build (works with private ACR) |
| State | Local or single Azure blob | Remote state with state locking required |
| Audit | Container logs only | PG/KV/ACR diagnostic settings → Log Analytics (90 days) |
| CI/CD gate | None | GitHub production Environment with required reviewers |
| Destroy guard | One click | Typed PROD-DESTROY-CONFIRM + reviewer approval |
| Component | Default | With HA ZoneRedundant + geo-backups |
|---|---|---|
| PG Flexible Server (GP_Standard_D2s_v3, 32 GB) | ~AUD 180/mo | ~AUD 280/mo |
| PG backups (14 days) | ~AUD 5/mo | ~AUD 7/mo (geo) |
| ACR Premium | ~AUD 70/mo | same |
| VNet + private endpoints (3) | ~AUD 25/mo | same |
| Key Vault Premium | ~AUD 1/mo | same |
| Log Analytics (90 d, ~10 GB/mo) | ~AUD 15/mo | same |
| Container Apps env + Job (idle) | ~AUD 5/mo | same |
| Total | ~AUD 300/mo | ~AUD 400/mo |
Cost dials in variables.tf:
pg_sku— drop toB1msfor non-production proofs (~AUD 25/mo, no SLA)ha_mode—Disabledsaves ~AUD 100/mo; raise only when you're readygeo_redundant_backups— adds ~30% to backup cost, enables cross-region restorebackup_retention_days— 7 minimum, 35 maximum, linear cost increase
Production state MUST live in remote Azure storage with state locking.
# Replace REGION and PROD_SUB with your values.
REGION="australiaeast"
SUFFIX=$(openssl rand -hex 3)
az group create --name rg-tfstate --location $REGION
az storage account create \
--name "sttfstateprod$SUFFIX" \
--resource-group rg-tfstate \
--location $REGION \
--sku Standard_LRS \
--encryption-services blob \
--min-tls-version TLS1_2
az storage container create \
--account-name "sttfstateprod$SUFFIX" \
--name tfstate \
--auth-mode login
echo "Now uncomment the backend block in infra/terraform-prod/main.tf"
echo "Set storage_account_name = 'sttfstateprod$SUFFIX'"Edit infra/terraform-prod/main.tf and uncomment the backend "azurerm"
block, replacing the placeholder values.
Use a SEPARATE SP from Dev. Scope it tightly.
APP_NAME="sp-te-github-actions-prod"
REPO="YOUR_GITHUB_ORG/Migration-using-ai"
APP_ID=$(az ad app create --display-name "$APP_NAME" --query appId -o tsv)
az ad sp create --id "$APP_ID"
# Grant Contributor scoped to a future RG (replace name after first apply
# OR grant on subscription temporarily, then narrow scope).
SUB_ID=$(az account show --query id -o tsv)
az role assignment create \
--assignee "$APP_ID" \
--role Contributor \
--scope "/subscriptions/$SUB_ID"
# Required additional role to assign roles to the runner identity:
az role assignment create \
--assignee "$APP_ID" \
--role "User Access Administrator" \
--scope "/subscriptions/$SUB_ID"
# Federated credentials for both main branch and PR validation
az ad app federated-credential create \
--id "$APP_ID" \
--parameters "{
\"name\": \"github-main\",
\"issuer\": \"https://token.actions.githubusercontent.com\",
\"subject\": \"repo:$REPO:environment:production\",
\"audiences\": [\"api://AzureADTokenExchange\"]
}"
echo "AZURE_PROD_CLIENT_ID=$APP_ID"
echo "AZURE_TENANT_ID=$(az account show --query tenantId -o tsv)"
echo "AZURE_PROD_SUBSCRIPTION_ID=$SUB_ID"Note the federated credential uses environment:production (not a
branch) — this binds the SP to GitHub Environment approval.
- Repo -> Settings -> Environments -> New environment: name
production. - Add Required reviewers (at least one other person).
- Add Deployment branch rule:
mainonly. - In environment secrets, add:
AZURE_PROD_CLIENT_IDAZURE_PROD_SUBSCRIPTION_ID(Tenant ID stays in repo-level secrets — same as Dev.)
- GitHub -> Actions -> Azure infra — terraform apply (Prod).
- Click Run workflow -> action:
plan-> Run. Wait for an approving reviewer. - Review the plan output. Re-run with action:
applyonce reviewed. - ~10-15 minutes (PG Flex provisioning + private endpoints are slow).
- GitHub -> Actions -> Azure migration — build, push, run (Prod).
- Click Run workflow:
- action:
deploy(start with just the schema) - change_ticket: e.g.
CHG-2026-001(logged in run summary)
- action:
- Wait for reviewer approval.
- The workflow:
- Builds the image via
az acr build(remote build inside Azure VNet) - Updates the Container Apps Job image tag
- Triggers a single execution of the Job
- Streams logs from Log Analytics
- Exits 0 only if the Job status is
Succeeded
- Builds the image via
# Get the RG name
RG=$(az group list --tag environment=prod \
--query "[?starts_with(name, 'rg-te-prod-')].name | [0]" -o tsv)
# Check job execution history
az containerapp job execution list \
--resource-group "$RG" \
--name job-migration-prod \
-o table
# PG sanity check from any VM in the VNet:
psql "host=<pg_fqdn> port=5432 user=pgadmin dbname=te_prod sslmode=require"
\dt te_prod.*PG, ACR, and KV are all private-endpoint-only. To reach them from your laptop:
- Quickest — Azure Cloud Shell (already in the tenant, can reach private endpoints with peering)
- Standard — deploy a small VM in
snet-endpoints, SSH in, thenpsql/azfrom there - Best for teams — Azure Bastion to a Linux jump-box in the VNet
- Persistent — Site-to-site VPN or ExpressRoute (use existing corporate connectivity)
Add the jump-box to main.tf if you want it as code; the VNet outputs
the subnet IDs you need.
When you've validated everything in Dev and want to push the schema + some seed data to Prod for the first time:
# 1. Dump just the schema from Dev (no data)
pg_dump --schema-only --no-owner --no-privileges \
"host=<dev_pg> user=postgres dbname=te_dev" \
> te_schema.sql
# 2. Apply to Prod from a VM inside the Prod VNet
psql "host=<prod_pg> user=pgadmin dbname=te_prod sslmode=require" \
-v ON_ERROR_STOP=1 -f te_schema.sql
# 3. Trigger the prod migration workflow with action=load to populate
# via the framework's normal load_input_data.sql path.# 1. Dump schema + only the reference tables (not transactional data)
pg_dump --schema-only "..." > schema.sql
pg_dump --data-only --table=te_dev.organisations \
--table=te_dev.programs \
--table=te_dev.classifications \
"..." > reference.sql
# 2. Rename te_dev schema references to te_prod in both files
sed -i 's/te_dev/te_prod/g' schema.sql reference.sql
# 3. Apply both, in order
psql "..." -v ON_ERROR_STOP=1 -f schema.sql
psql "..." -v ON_ERROR_STOP=1 -f reference.sqlFor early prod proof-of-concept only. Real prod should have its own data.
pg_dump "host=<dev_pg> ..." > full_dev.sql
sed -i 's/te_dev/te_prod/g' full_dev.sql
psql "host=<prod_pg> ..." -f full_dev.sqlPG Flex backups run automatically (PITR). To restore:
# 1. Take note of the point in time you want to restore to
# 2. Create a new server from the backup
az postgres flexible-server restore \
--resource-group "$RG" \
--name "psql-te-prod-restored-$(date +%Y%m%d)" \
--source-server psql-te-prod-<suffix> \
--restore-time "2026-06-15T14:30:00Z"
# 3. Update DNS / connection strings to point at the new server
# 4. (Optional) Delete the original server once you're sureIf geo_redundant_backups = true, you can also --restore-time in a
paired region for region-failure DR.
The cleanest rollback for a bad migration:
-
Code-level — re-run the workflow with the previous image tag:
Action: deploy Image tag: <previous-7-char-sha> Change ticket: ROLLBACK-CHG-<original> -
Data-level — PITR restore to a moment before the change.
-
Infra-level —
terraform applyon the previous commit reverts IaC changes. Useterraform refreshfirst to capture drift.
For schema changes, the framework's ON CONFLICT DO NOTHING discipline
means re-running an older deploy is generally safe. New columns added
by the bad migration will linger until you DROP them explicitly.
Prod ships logs/metrics to Log Analytics for:
- PG queries:
AzureDiagnostics | where Category == "PostgreSQLLogs" - PG sessions:
AzureDiagnostics | where Category == "PostgreSQLFlexSessions" - KV access:
AzureDiagnostics | where Category == "AuditEvent" - ACR pulls/pushes:
ContainerRegistryRepositoryEvents | order by TimeGenerated desc - Job runs:
ContainerAppConsoleLogs_CL | where ContainerJobName_s == "job-migration-prod"
Suggested alerts (add via azurerm_monitor_metric_alert):
- PG CPU > 80% for 15 minutes
- PG storage > 85%
- PG failed connections > 10/min
- Migration job failed
- KV secret read failure
- Remote Terraform state with state locking enabled
- Separate Prod service principal, scoped to the prod RG
- GitHub
productionenvironment with ≥ 1 required reviewer -
ha_mode = "ZoneRedundant"once budget permits -
geo_redundant_backups = trueif DR matters - Backups tested via a real restore at least once
- Diagnostic settings retention ≥ 90 days
- Azure Monitor alerts for PG health
- Jump-box VM or Bastion documented for break-glass access
- Runbook for "PG password rotation" (rotate the KV secret, restart Job)
- Quarterly review of role assignments — drop anyone who's left
AZURE_DEPLOY.md— Dev deployment guideinfra/terraform-prod/— Prod IaCinfra/terraform-prod/variables.tf— cost/HA dials.github/workflows/azure-prod-*.yml— Prod CI/CDARCHITECTURE.md— overall framework architectureVCRM.md— verification cross-reference m