Skip to content

Operations Runbook

Tim Krebs edited this page Apr 3, 2026 · 1 revision

Operations Runbook

Day-2 operations guide for maintaining, troubleshooting, and recovering the Netlix Platform.


Cluster Access

EKS kubeconfig

# Dev cluster
aws eks update-kubeconfig --region eu-central-1 --name netlix-dev

# Staging cluster
aws eks update-kubeconfig --region eu-central-1 --name netlix-staging

# Verify
kubectl get nodes

Vault Access

export VAULT_ADDR="https://netlix-vault-public-vault-....hashicorp.cloud:8200"
export VAULT_NAMESPACE="admin"

# Admin login (dev only — userpass auth)
vault login -method=userpass username=<admin>

# Switch to environment namespace
export VAULT_NAMESPACE="admin/dev"
vault secrets list

ArgoCD Access

# Via URL
open https://argocd.dev.netlix.dev

# Via port-forward (if DNS unavailable)
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Common Operations

Deploy a New Application Version

The normal flow is fully automated:

  1. Push code changes to dev branch
  2. CD pipeline builds, scans, pushes image to GHCR
  3. CD pipeline updates Kustomize overlay with new image tag
  4. ArgoCD detects change and syncs

Manual override (if automation fails):

cd app/overlays/dev
kustomize edit set image netlix-web=ghcr.io/timkrebs/netlix-platform/web:dev-<sha>
git add . && git commit -m "chore: manual image tag update" && git push

Promote to Staging

gh workflow run promote.yaml -f from=dev -f to=staging
# Review and merge the created PR

Promote to Main (Release)

gh workflow run promote.yaml -f from=staging -f to=main
# Review and merge the created PR

# Tag the release
git checkout main && git pull
git tag -a v1.x.x -m "release: v1.x.x — description"
git push origin v1.x.x

Rotate Vault PKI Certificates

VSO handles certificate rotation automatically. To force rotation:

# Delete the TLS secret — VSO will recreate it
kubectl delete secret netlix-tls-cert -n consul

# Verify new cert was issued
kubectl get secret netlix-tls-cert -n consul -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

Rotate Database Credentials

Vault dynamic credentials rotate automatically at 67% TTL. To force rotation:

# Revoke all leases for the database role
export VAULT_NAMESPACE="admin/dev"
vault lease revoke -prefix database/creds/netlix-app

# Pods will get new credentials on next VSO sync cycle

Scale Nodes

# Temporary manual scaling (reverts on next Terraform apply)
aws eks update-nodegroup-config \
  --cluster-name netlix-dev \
  --nodegroup-name <node-group-name> \
  --scaling-config minSize=3,maxSize=8,desiredSize=5

# Permanent scaling — update deployments.tfdeploy.hcl:
#   node_min_size = 3
#   node_max_size = 8
#   node_desired_size = 5
# Push to trigger Stack update

Troubleshooting

DNS Resolution Failures

Symptom: ERR_NAME_NOT_RESOLVED for app.dev.netlix.dev or argocd.dev.netlix.dev

Check Route53 records:

aws route53 list-resource-record-sets --hosted-zone-id Z03825243OZJVWRUDJ5T \
  --query "ResourceRecordSets[?Name=='app.dev.netlix.dev.']"

Check ExternalDNS:

kubectl logs -n kube-system -l app.kubernetes.io/name=external-dns --tail=50

Check Ingress:

kubectl get ingress -n consul
kubectl describe ingress -n consul

Flush local DNS cache (macOS):

sudo dscacheutil -flushcache
sudo killall -HUP mDNSResponder

Root cause to check: If Route53 zone was recreated as a Terraform managed resource, nameservers changed and registrar delegation broke. The zone MUST be a data source. See ADR-003.

Pods Not Starting

# Check pod status
kubectl get pods -n consul

# Describe failing pod
kubectl describe pod <pod-name> -n consul

# Check events
kubectl get events -n consul --sort-by='.lastTimestamp'

# Check logs
kubectl logs <pod-name> -n consul
kubectl logs <pod-name> -n consul -c envoy-sidecar  # Consul sidecar

Common causes:

  • Image pull error: Check GHCR credentials and image tag
  • OOMKilled: Increase memory limits in deployment
  • CrashLoopBackOff: Check application logs
  • Pending: Check node capacity and resource quotas

Vault 403 Permission Denied

Symptom: Terraform plan/apply fails with 403 Forbidden on Vault operations

Check TFC policy:

export VAULT_NAMESPACE="admin"
vault policy read tfc-policy

Important: The TFC policy MUST be path "*" with full capabilities. Scoped paths do not propagate to child namespaces in HCP Vault. See ADR-005.

Emergency restore:

vault login -method=userpass username=<admin>
vault policy write tfc-policy - <<EOF
path "*" {
  capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}
EOF

ArgoCD Sync Failures

# Check application status
kubectl get application netlix-app -n argocd -o yaml | grep -A10 status

# Check sync result
kubectl get application netlix-app -n argocd -o jsonpath='{.status.sync.status}'

# Force sync
kubectl patch application netlix-app -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"revision":"HEAD"}}}'

Terraform Stack Failures

Check in HCP Terraform UI:

  1. Navigate to the Stack
  2. Check the latest deployment run
  3. Review plan/apply logs for errors

Common issues:

  • Ephemeral value errors: Cannot pass store.varset values to non-ephemeral component inputs
  • Deferred changes: Resources depending on runtime values (e.g., cluster endpoint) may defer to a second apply
  • Provider auth failures: Check OIDC token, IAM role trust policy, Vault JWT auth

CloudWatch Alarm Firing

VPC Rejected Connections alarm:

# Check recent flow log rejections
aws logs filter-log-events \
  --log-group-name <flow-log-group> \
  --filter-pattern "REJECT" \
  --start-time $(date -d '-1 hour' +%s000) \
  --limit 20

RDS CPU alarm:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=netlix-dev \
  --start-time $(date -d '-1 hour' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

Disaster Recovery

Current Status

Capability Status Notes
Infrastructure recreation Supported Terraform Stacks can reprovision from code
RDS backups Automated 7-day retention, point-in-time recovery
RDS cross-region replica Not implemented Planned (IMPROVEMENTS.md #46)
EKS cluster backup (Velero) Not implemented Planned (IMPROVEMENTS.md #45)
Vault backup HCP-managed HCP Vault Dedicated includes automated backups
GitOps state Git repository All manifests in version control

RDS Recovery

Point-in-time restore:

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier netlix-dev \
  --target-db-instance-identifier netlix-dev-restored \
  --restore-time <ISO-8601-timestamp>

From latest snapshot:

aws rds describe-db-snapshots --db-instance-identifier netlix-dev --query 'DBSnapshots[-1].DBSnapshotIdentifier'
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier netlix-dev-restored \
  --db-snapshot-identifier <snapshot-id>

Full Environment Recreation

If an environment needs to be rebuilt from scratch:

  1. Infrastructure: Push to the Stack (or trigger manually in HCP Terraform). All 12 components rebuild in dependency order.
  2. Vault configuration: Automatically rebuilt by vault_config component (PKI CAs, auth backends, policies, DB engine).
  3. Kubernetes workloads: ArgoCD syncs from Git after the cluster is ready.
  4. Datadog: Recreate API key secret and apply DatadogAgent CRD.
  5. DNS: Route53 zone is a data source (never recreated). ExternalDNS recreates A records from Ingress.

Maintenance Windows

EKS Version Upgrades

  1. Update cluster_version in deployments.tfdeploy.hcl
  2. Test in dev first
  3. Promote to staging after validation
  4. Monitor node group rolling update

RDS Engine Upgrades

  1. Update db_engine_version in deployments.tfdeploy.hcl
  2. For major versions: test with a snapshot restore first
  3. Apply during low-traffic window (RDS may restart)

Helm Chart Updates

  1. Update chart version in the relevant component's main.tf
  2. Run terraform validate locally
  3. Push to dev, verify via ArgoCD
  4. Promote to staging

Health Checks

Quick Health Check Script

#!/bin/bash
echo "=== EKS Nodes ==="
kubectl get nodes

echo "=== Pods (consul namespace) ==="
kubectl get pods -n consul

echo "=== Pods (argocd namespace) ==="
kubectl get pods -n argocd

echo "=== Pods (datadog namespace) ==="
kubectl get pods -n datadog

echo "=== ArgoCD App Status ==="
kubectl get application netlix-app -n argocd -o jsonpath='{.status.sync.status}'
echo ""

echo "=== Ingress ==="
kubectl get ingress -n consul

echo "=== DNS Resolution ==="
nslookup app.dev.netlix.dev
nslookup argocd.dev.netlix.dev

echo "=== App Health ==="
curl -s -o /dev/null -w "%{http_code}" https://app.dev.netlix.dev/health
echo ""

Clone this wiki locally