-
Notifications
You must be signed in to change notification settings - Fork 0
Observability
The platform implements a multi-layer observability stack covering application performance monitoring (Datadog), infrastructure alarms (CloudWatch), network visibility (VPC Flow Logs), and service mesh metrics (Consul/Envoy).
| Layer | Tool | What It Covers |
|---|---|---|
| APM & Traces | Datadog | Request traces, latency, error rates across web/api |
| Log Collection | Datadog Agent | Container stdout/stderr from all pods |
| Infrastructure Metrics | Datadog Agent | CPU, memory, disk, network for nodes and pods |
| Cluster Monitoring | Datadog Cluster Agent | Kubernetes events, deployments, node health |
| Infrastructure Alarms | CloudWatch | RDS health, VPC flow log anomalies |
| Network Visibility | VPC Flow Logs | All network traffic (ACCEPT + REJECT) |
| Service Mesh Metrics | Consul/Envoy | mTLS connections, request routing, circuit breaking |
The Datadog Operator pattern is used for deployment:
Terraform (datadog-operator component)
│
├── kubernetes_namespace_v1 "datadog"
└── helm_release "datadog-operator" (v2.1.0)
│
▼
Datadog Operator (controller)
│
▼
DatadogAgent CRD (applied via kubectl)
│
├── Node Agent (DaemonSet) ─── metrics, logs, APM traces
├── Cluster Agent ──────────── Kubernetes API, events, orchestrator
└── Cluster Checks Runner ──── service discovery checks
Step 1: Terraform creates the namespace and operator:
# terraform/components/datadog-operator/main.tf
resource "kubernetes_namespace_v1" "datadog" {
metadata { name = "datadog" }
}
resource "helm_release" "datadog_operator" {
name = "datadog-operator"
namespace = "datadog"
repository = "https://helm.datadoghq.com"
chart = "datadog-operator"
version = "2.1.0"
}Step 2: Create the API key secret manually (Stacks ephemeral value constraint):
kubectl create secret generic datadog-secret \
--namespace datadog \
--from-literal api-key=<YOUR_DATADOG_API_KEY>Step 3: Apply the DatadogAgent CRD:
kubectl apply -f kubernetes/monitoring/datadog-agent.yamlWhy manual steps? HCP Terraform Stacks treats all
store.varsetvalues as ephemeral. They cannot flow intokubernetes_secretresources that persist to state. Additionally,kubernetes_manifestresources cause null value errors during Stacks deferred changes because the CRD namespace doesn't exist at plan time.
Reference manifest at kubernetes/monitoring/datadog-agent.yaml:
| Feature | Configuration |
|---|---|
| Cluster Name | netlix-dev |
| Site | datadoghq.eu |
| Cluster Agent | Enabled with orchestrator explorer |
| Cluster Checks | Enabled for service discovery |
| APM | Enabled with auto-instrumentation |
| Auto-instrumentation | Java, Python, JS, PHP, .NET, Ruby via libVersions
|
| Log Collection | Container logs enabled |
| Prometheus Scrape | Enabled (scrapes Envoy metrics on port 20200) |
Both web and api services include Datadog unified service tags for correlated metrics, traces, and logs:
Pod template labels:
labels:
tags.datadoghq.com/service: "web" # or "api"
tags.datadoghq.com/env: "dev" # via overlay patch
tags.datadoghq.com/version: "latest" # via overlay patchEnvironment variables:
env:
- name: DD_SERVICE
value: "web"
- name: DD_ENV
value: "dev"
- name: DD_VERSION
value: "latest"These tags enable:
- Correlated views across metrics, traces, and logs in Datadog
- Service catalog population
- Deployment tracking
- Error tracking by service and version
# Check Datadog pods
kubectl get pods -n datadog
# Expected pods:
# datadog-agent-<hash> (DaemonSet - one per node)
# datadog-cluster-agent-<hash> (Deployment)
# datadog-operator-<hash> (Deployment)
# Check agent status
kubectl exec -it -n datadog $(kubectl get pod -n datadog -l app.kubernetes.io/name=datadog-agent-deployment -o name | head -1) -- agent statusThe Terraform monitoring component creates CloudWatch infrastructure:
terraform/components/monitoring/
├── main.tf # SNS topic, metric filters, alarms
├── variables.tf # environment, project, RDS/EKS identifiers
├── outputs.tf # SNS topic ARN
└── versions.tf
Topic: {project}-{environment}-alerts
Subscribers: alert_email (if configured)
| Alarm | Metric | Threshold | Period | Action |
|---|---|---|---|---|
| VPC Rejected Connections | Custom (metric filter on flow logs) | > 100 | 5 min | SNS |
| RDS CPU Utilization | CPUUtilization |
> 80% | 5 min | SNS |
| RDS Free Storage | FreeStorageSpace |
< 5 GB | 5 min | SNS |
| RDS Database Connections | DatabaseConnections |
> 80% of max | 5 min | SNS |
VPC Flow Logs capture all network traffic to CloudWatch Logs:
VPC → Flow Log → CloudWatch Log Group → Metric Filter (REJECT) → Alarm → SNS
The metric filter extracts REJECT actions from flow logs and increments a custom CloudWatch metric. When rejected connections exceed 100 in a 5-minute window, the alarm fires and sends an SNS notification.
Both web and api pods run Envoy sidecars that expose Prometheus metrics on port 20200:
| Metric | Description |
|---|---|
envoy_http_downstream_rq_total |
Total requests received |
envoy_http_downstream_rq_xx |
Requests by status code class (2xx, 4xx, 5xx) |
envoy_http_downstream_rq_time |
Request latency histogram |
envoy_cluster_upstream_cx_active |
Active upstream connections |
envoy_cluster_upstream_rq_retry |
Upstream retry count |
Datadog Agent scrapes Envoy metrics via Prometheus annotations on the pod:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "20200"
prometheus.io/path: "/metrics"| Dashboard | Key Widgets |
|---|---|
| Service Overview | Request rate, error rate, latency (p50/p95/p99) per service |
| Infrastructure | Node CPU/memory, pod restarts, OOMKills |
| Database | Connection count, query latency, replication lag |
| Network | VPC flow log rejections, DNS resolution time |
| Monitor | Condition | Severity |
|---|---|---|
| High error rate | 5xx > 5% of requests over 5 min | Critical |
| High latency | p99 > 2s over 10 min | Warning |
| Pod restarts | > 3 restarts in 15 min | Critical |
| OOMKill | Any OOMKill event | Critical |
| Node not ready | Node NotReady > 5 min | Critical |
| Certificate expiry | < 7 days to expiry | Warning |
┌─────────────┐ ┌──────────────┐ ┌───────────────┐
│ Application │────▶│ Datadog Agent│────▶│ Datadog Cloud │
│ Pods │ │ (DaemonSet) │ │ (datadoghq.eu)│
│ │ │ │ │ │
│ APM traces │ │ Collects: │ │ Dashboards │
│ DD_* env vars│ │ - traces │ │ Monitors │
│ stdout logs │ │ - metrics │ │ APM │
└─────────────┘ │ - logs │ │ Logs │
└──────────────┘ └───────────────┘
┌─────────────┐ ┌──────────────┐ ┌───────────────┐
│ VPC/RDS/EKS │────▶│ CloudWatch │────▶│ SNS │
│ │ │ │ │ │
│ Flow logs │ │ Metric │ │ Email alerts │
│ RDS metrics │ │ filters │ │ │
│ EKS metrics │ │ Alarms │ │ │
└─────────────┘ └──────────────┘ └───────────────┘
┌─────────────┐ ┌──────────────┐
│ Envoy │────▶│ Prometheus │────▶ Scraped by Datadog Agent
│ Sidecars │ │ :20200 │
│ │ │ /metrics │
└─────────────┘ └──────────────┘