Operations Runbook

Day-2 operations guide for maintaining, troubleshooting, and recovering the Netlix Platform.

Cluster Access

EKS kubeconfig

# Dev cluster
aws eks update-kubeconfig --region eu-central-1 --name netlix-dev

# Staging cluster
aws eks update-kubeconfig --region eu-central-1 --name netlix-staging

# Verify
kubectl get nodes

Vault Access

export VAULT_ADDR="https://netlix-vault-public-vault-....hashicorp.cloud:8200"
export VAULT_NAMESPACE="admin"

# Admin login (dev only — userpass auth)
vault login -method=userpass username=<admin>

# Switch to environment namespace
export VAULT_NAMESPACE="admin/dev"
vault secrets list

ArgoCD Access

# Via URL
open https://argocd.dev.netlix.dev

# Via port-forward (if DNS unavailable)
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Common Operations

Deploy a New Application Version

The normal flow is fully automated:

Push code changes to dev branch
CD pipeline builds, scans, pushes image to GHCR
CD pipeline updates Kustomize overlay with new image tag
ArgoCD detects change and syncs

Manual override (if automation fails):

cd app/overlays/dev
kustomize edit set image netlix-web=ghcr.io/timkrebs/netlix-platform/web:dev-<sha>
git add . && git commit -m "chore: manual image tag update" && git push

Promote to Staging

gh workflow run promote.yaml -f from=dev -f to=staging
# Review and merge the created PR

Promote to Main (Release)

gh workflow run promote.yaml -f from=staging -f to=main
# Review and merge the created PR

# Tag the release
git checkout main && git pull
git tag -a v1.x.x -m "release: v1.x.x — description"
git push origin v1.x.x

Rotate Vault PKI Certificates

VSO handles certificate rotation automatically. To force rotation:

# Delete the TLS secret — VSO will recreate it
kubectl delete secret netlix-tls-cert -n consul

# Verify new cert was issued
kubectl get secret netlix-tls-cert -n consul -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

Rotate Database Credentials

Vault dynamic credentials rotate automatically at 67% TTL. To force rotation:

# Revoke all leases for the database role
export VAULT_NAMESPACE="admin/dev"
vault lease revoke -prefix database/creds/netlix-app

# Pods will get new credentials on next VSO sync cycle

Scale Nodes

# Temporary manual scaling (reverts on next Terraform apply)
aws eks update-nodegroup-config \
  --cluster-name netlix-dev \
  --nodegroup-name <node-group-name> \
  --scaling-config minSize=3,maxSize=8,desiredSize=5

# Permanent scaling — update deployments.tfdeploy.hcl:
#   node_min_size = 3
#   node_max_size = 8
#   node_desired_size = 5
# Push to trigger Stack update

Troubleshooting

DNS Resolution Failures

Symptom: ERR_NAME_NOT_RESOLVED for app.dev.netlix.dev or argocd.dev.netlix.dev

Check Route53 records:

aws route53 list-resource-record-sets --hosted-zone-id Z03825243OZJVWRUDJ5T \
  --query "ResourceRecordSets[?Name=='app.dev.netlix.dev.']"

Check ExternalDNS:

kubectl logs -n kube-system -l app.kubernetes.io/name=external-dns --tail=50

Check Ingress:

kubectl get ingress -n consul
kubectl describe ingress -n consul

Flush local DNS cache (macOS):

sudo dscacheutil -flushcache
sudo killall -HUP mDNSResponder

Root cause to check: If Route53 zone was recreated as a Terraform managed resource, nameservers changed and registrar delegation broke. The zone MUST be a data source. See ADR-003.

Pods Not Starting

# Check pod status
kubectl get pods -n consul

# Describe failing pod
kubectl describe pod <pod-name> -n consul

# Check events
kubectl get events -n consul --sort-by='.lastTimestamp'

# Check logs
kubectl logs <pod-name> -n consul
kubectl logs <pod-name> -n consul -c envoy-sidecar  # Consul sidecar

Common causes:

Image pull error: Check GHCR credentials and image tag
OOMKilled: Increase memory limits in deployment
CrashLoopBackOff: Check application logs
Pending: Check node capacity and resource quotas

Vault 403 Permission Denied

Symptom: Terraform plan/apply fails with 403 Forbidden on Vault operations

Check TFC policy:

export VAULT_NAMESPACE="admin"
vault policy read tfc-policy

Important: The TFC policy MUST be path "*" with full capabilities. Scoped paths do not propagate to child namespaces in HCP Vault. See ADR-005.

Emergency restore:

vault login -method=userpass username=<admin>
vault policy write tfc-policy - <<EOF
path "*" {
  capabilities = ["create", "read", "update", "delete", "list", "sudo"]
}
EOF

ArgoCD Sync Failures

# Check application status
kubectl get application netlix-app -n argocd -o yaml | grep -A10 status

# Check sync result
kubectl get application netlix-app -n argocd -o jsonpath='{.status.sync.status}'

# Force sync
kubectl patch application netlix-app -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"revision":"HEAD"}}}'

Terraform Stack Failures

Check in HCP Terraform UI:

Navigate to the Stack
Check the latest deployment run
Review plan/apply logs for errors

Common issues:

Ephemeral value errors: Cannot pass store.varset values to non-ephemeral component inputs
Deferred changes: Resources depending on runtime values (e.g., cluster endpoint) may defer to a second apply
Provider auth failures: Check OIDC token, IAM role trust policy, Vault JWT auth

CloudWatch Alarm Firing

VPC Rejected Connections alarm:

# Check recent flow log rejections
aws logs filter-log-events \
  --log-group-name <flow-log-group> \
  --filter-pattern "REJECT" \
  --start-time $(date -d '-1 hour' +%s000) \
  --limit 20

RDS CPU alarm:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=netlix-dev \
  --start-time $(date -d '-1 hour' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

Disaster Recovery

Current Status

Capability	Status	Notes
Infrastructure recreation	Supported	Terraform Stacks can reprovision from code
RDS backups	Automated	7-day retention, point-in-time recovery
RDS cross-region replica	Not implemented	Planned (IMPROVEMENTS.md #46)
EKS cluster backup (Velero)	Not implemented	Planned (IMPROVEMENTS.md #45)
Vault backup	HCP-managed	HCP Vault Dedicated includes automated backups
GitOps state	Git repository	All manifests in version control

RDS Recovery

Point-in-time restore:

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier netlix-dev \
  --target-db-instance-identifier netlix-dev-restored \
  --restore-time <ISO-8601-timestamp>

From latest snapshot:

aws rds describe-db-snapshots --db-instance-identifier netlix-dev --query 'DBSnapshots[-1].DBSnapshotIdentifier'
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier netlix-dev-restored \
  --db-snapshot-identifier <snapshot-id>

Full Environment Recreation

If an environment needs to be rebuilt from scratch:

Infrastructure: Push to the Stack (or trigger manually in HCP Terraform). All 12 components rebuild in dependency order.
Vault configuration: Automatically rebuilt by vault_config component (PKI CAs, auth backends, policies, DB engine).
Kubernetes workloads: ArgoCD syncs from Git after the cluster is ready.
Datadog: Recreate API key secret and apply DatadogAgent CRD.
DNS: Route53 zone is a data source (never recreated). ExternalDNS recreates A records from Ingress.

Maintenance Windows

EKS Version Upgrades

Update cluster_version in deployments.tfdeploy.hcl
Test in dev first
Promote to staging after validation
Monitor node group rolling update

RDS Engine Upgrades

Update db_engine_version in deployments.tfdeploy.hcl
For major versions: test with a snapshot restore first
Apply during low-traffic window (RDS may restart)

Helm Chart Updates

Update chart version in the relevant component's main.tf
Run terraform validate locally
Push to dev, verify via ArgoCD
Promote to staging

Health Checks

Quick Health Check Script

#!/bin/bash
echo "=== EKS Nodes ==="
kubectl get nodes

echo "=== Pods (consul namespace) ==="
kubectl get pods -n consul

echo "=== Pods (argocd namespace) ==="
kubectl get pods -n argocd

echo "=== Pods (datadog namespace) ==="
kubectl get pods -n datadog

echo "=== ArgoCD App Status ==="
kubectl get application netlix-app -n argocd -o jsonpath='{.status.sync.status}'
echo ""

echo "=== Ingress ==="
kubectl get ingress -n consul

echo "=== DNS Resolution ==="
nslookup app.dev.netlix.dev
nslookup argocd.dev.netlix.dev

echo "=== App Health ==="
curl -s -o /dev/null -w "%{http_code}" https://app.dev.netlix.dev/health
echo ""

Operations Runbook

Operations Runbook

Cluster Access

EKS kubeconfig

Vault Access

ArgoCD Access

Common Operations

Deploy a New Application Version

Promote to Staging

Promote to Main (Release)

Rotate Vault PKI Certificates

Rotate Database Credentials

Scale Nodes

Troubleshooting

DNS Resolution Failures

Pods Not Starting

Vault 403 Permission Denied

ArgoCD Sync Failures

Terraform Stack Failures

CloudWatch Alarm Firing

Disaster Recovery

Current Status

RDS Recovery

Full Environment Recreation

Maintenance Windows

EKS Version Upgrades

RDS Engine Upgrades

Helm Chart Updates

Health Checks

Quick Health Check Script

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally