-
Notifications
You must be signed in to change notification settings - Fork 0
Operations Runbook
Day-2 operations guide for maintaining, troubleshooting, and recovering the Netlix Platform.
# Dev cluster
aws eks update-kubeconfig --region eu-central-1 --name netlix-dev
# Staging cluster
aws eks update-kubeconfig --region eu-central-1 --name netlix-staging
# Verify
kubectl get nodesexport VAULT_ADDR="https://netlix-vault-public-vault-....hashicorp.cloud:8200"
export VAULT_NAMESPACE="admin"
# Admin login (dev only — userpass auth)
vault login -method=userpass username=<admin>
# Switch to environment namespace
export VAULT_NAMESPACE="admin/dev"
vault secrets list# Via URL
open https://argocd.dev.netlix.dev
# Via port-forward (if DNS unavailable)
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -dThe normal flow is fully automated:
- Push code changes to
devbranch - CD pipeline builds, scans, pushes image to GHCR
- CD pipeline updates Kustomize overlay with new image tag
- ArgoCD detects change and syncs
Manual override (if automation fails):
cd app/overlays/dev
kustomize edit set image netlix-web=ghcr.io/timkrebs/netlix-platform/web:dev-<sha>
git add . && git commit -m "chore: manual image tag update" && git pushgh workflow run promote.yaml -f from=dev -f to=staging
# Review and merge the created PRgh workflow run promote.yaml -f from=staging -f to=main
# Review and merge the created PR
# Tag the release
git checkout main && git pull
git tag -a v1.x.x -m "release: v1.x.x — description"
git push origin v1.x.xVSO handles certificate rotation automatically. To force rotation:
# Delete the TLS secret — VSO will recreate it
kubectl delete secret netlix-tls-cert -n consul
# Verify new cert was issued
kubectl get secret netlix-tls-cert -n consul -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -datesVault dynamic credentials rotate automatically at 67% TTL. To force rotation:
# Revoke all leases for the database role
export VAULT_NAMESPACE="admin/dev"
vault lease revoke -prefix database/creds/netlix-app
# Pods will get new credentials on next VSO sync cycle# Temporary manual scaling (reverts on next Terraform apply)
aws eks update-nodegroup-config \
--cluster-name netlix-dev \
--nodegroup-name <node-group-name> \
--scaling-config minSize=3,maxSize=8,desiredSize=5
# Permanent scaling — update deployments.tfdeploy.hcl:
# node_min_size = 3
# node_max_size = 8
# node_desired_size = 5
# Push to trigger Stack updateSymptom: ERR_NAME_NOT_RESOLVED for app.dev.netlix.dev or argocd.dev.netlix.dev
Check Route53 records:
aws route53 list-resource-record-sets --hosted-zone-id Z03825243OZJVWRUDJ5T \
--query "ResourceRecordSets[?Name=='app.dev.netlix.dev.']"Check ExternalDNS:
kubectl logs -n kube-system -l app.kubernetes.io/name=external-dns --tail=50Check Ingress:
kubectl get ingress -n consul
kubectl describe ingress -n consulFlush local DNS cache (macOS):
sudo dscacheutil -flushcache
sudo killall -HUP mDNSResponderRoot cause to check: If Route53 zone was recreated as a Terraform managed resource, nameservers changed and registrar delegation broke. The zone MUST be a data source. See ADR-003.
# Check pod status
kubectl get pods -n consul
# Describe failing pod
kubectl describe pod <pod-name> -n consul
# Check events
kubectl get events -n consul --sort-by='.lastTimestamp'
# Check logs
kubectl logs <pod-name> -n consul
kubectl logs <pod-name> -n consul -c envoy-sidecar # Consul sidecarCommon causes:
- Image pull error: Check GHCR credentials and image tag
- OOMKilled: Increase memory limits in deployment
- CrashLoopBackOff: Check application logs
- Pending: Check node capacity and resource quotas
Symptom: Terraform plan/apply fails with 403 Forbidden on Vault operations
Check TFC policy:
export VAULT_NAMESPACE="admin"
vault policy read tfc-policyImportant: The TFC policy MUST be path "*" with full capabilities. Scoped paths do not propagate to child namespaces in HCP Vault. See ADR-005.
Emergency restore:
vault login -method=userpass username=<admin>
vault policy write tfc-policy - <<EOF
path "*" {
capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}
EOF# Check application status
kubectl get application netlix-app -n argocd -o yaml | grep -A10 status
# Check sync result
kubectl get application netlix-app -n argocd -o jsonpath='{.status.sync.status}'
# Force sync
kubectl patch application netlix-app -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"revision":"HEAD"}}}'Check in HCP Terraform UI:
- Navigate to the Stack
- Check the latest deployment run
- Review plan/apply logs for errors
Common issues:
-
Ephemeral value errors: Cannot pass
store.varsetvalues to non-ephemeral component inputs - Deferred changes: Resources depending on runtime values (e.g., cluster endpoint) may defer to a second apply
- Provider auth failures: Check OIDC token, IAM role trust policy, Vault JWT auth
VPC Rejected Connections alarm:
# Check recent flow log rejections
aws logs filter-log-events \
--log-group-name <flow-log-group> \
--filter-pattern "REJECT" \
--start-time $(date -d '-1 hour' +%s000) \
--limit 20RDS CPU alarm:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=netlix-dev \
--start-time $(date -d '-1 hour' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average| Capability | Status | Notes |
|---|---|---|
| Infrastructure recreation | Supported | Terraform Stacks can reprovision from code |
| RDS backups | Automated | 7-day retention, point-in-time recovery |
| RDS cross-region replica | Not implemented | Planned (IMPROVEMENTS.md #46) |
| EKS cluster backup (Velero) | Not implemented | Planned (IMPROVEMENTS.md #45) |
| Vault backup | HCP-managed | HCP Vault Dedicated includes automated backups |
| GitOps state | Git repository | All manifests in version control |
Point-in-time restore:
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier netlix-dev \
--target-db-instance-identifier netlix-dev-restored \
--restore-time <ISO-8601-timestamp>From latest snapshot:
aws rds describe-db-snapshots --db-instance-identifier netlix-dev --query 'DBSnapshots[-1].DBSnapshotIdentifier'
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier netlix-dev-restored \
--db-snapshot-identifier <snapshot-id>If an environment needs to be rebuilt from scratch:
- Infrastructure: Push to the Stack (or trigger manually in HCP Terraform). All 12 components rebuild in dependency order.
-
Vault configuration: Automatically rebuilt by
vault_configcomponent (PKI CAs, auth backends, policies, DB engine). - Kubernetes workloads: ArgoCD syncs from Git after the cluster is ready.
- Datadog: Recreate API key secret and apply DatadogAgent CRD.
- DNS: Route53 zone is a data source (never recreated). ExternalDNS recreates A records from Ingress.
- Update
cluster_versionindeployments.tfdeploy.hcl - Test in dev first
- Promote to staging after validation
- Monitor node group rolling update
- Update
db_engine_versionindeployments.tfdeploy.hcl - For major versions: test with a snapshot restore first
- Apply during low-traffic window (RDS may restart)
- Update chart
versionin the relevant component'smain.tf - Run
terraform validatelocally - Push to dev, verify via ArgoCD
- Promote to staging
#!/bin/bash
echo "=== EKS Nodes ==="
kubectl get nodes
echo "=== Pods (consul namespace) ==="
kubectl get pods -n consul
echo "=== Pods (argocd namespace) ==="
kubectl get pods -n argocd
echo "=== Pods (datadog namespace) ==="
kubectl get pods -n datadog
echo "=== ArgoCD App Status ==="
kubectl get application netlix-app -n argocd -o jsonpath='{.status.sync.status}'
echo ""
echo "=== Ingress ==="
kubectl get ingress -n consul
echo "=== DNS Resolution ==="
nslookup app.dev.netlix.dev
nslookup argocd.dev.netlix.dev
echo "=== App Health ==="
curl -s -o /dev/null -w "%{http_code}" https://app.dev.netlix.dev/health
echo ""