Skip to content

Observability

Tim Krebs edited this page Apr 3, 2026 · 1 revision

Observability

The platform implements a multi-layer observability stack covering application performance monitoring (Datadog), infrastructure alarms (CloudWatch), network visibility (VPC Flow Logs), and service mesh metrics (Consul/Envoy).


Observability Stack

Layer Tool What It Covers
APM & Traces Datadog Request traces, latency, error rates across web/api
Log Collection Datadog Agent Container stdout/stderr from all pods
Infrastructure Metrics Datadog Agent CPU, memory, disk, network for nodes and pods
Cluster Monitoring Datadog Cluster Agent Kubernetes events, deployments, node health
Infrastructure Alarms CloudWatch RDS health, VPC flow log anomalies
Network Visibility VPC Flow Logs All network traffic (ACCEPT + REJECT)
Service Mesh Metrics Consul/Envoy mTLS connections, request routing, circuit breaking

Datadog

Architecture

The Datadog Operator pattern is used for deployment:

Terraform (datadog-operator component)
    │
    ├── kubernetes_namespace_v1 "datadog"
    └── helm_release "datadog-operator" (v2.1.0)
            │
            ▼
Datadog Operator (controller)
    │
    ▼
DatadogAgent CRD (applied via kubectl)
    │
    ├── Node Agent (DaemonSet) ─── metrics, logs, APM traces
    ├── Cluster Agent ──────────── Kubernetes API, events, orchestrator
    └── Cluster Checks Runner ──── service discovery checks

Deployment

Step 1: Terraform creates the namespace and operator:

# terraform/components/datadog-operator/main.tf
resource "kubernetes_namespace_v1" "datadog" {
  metadata { name = "datadog" }
}

resource "helm_release" "datadog_operator" {
  name       = "datadog-operator"
  namespace  = "datadog"
  repository = "https://helm.datadoghq.com"
  chart      = "datadog-operator"
  version    = "2.1.0"
}

Step 2: Create the API key secret manually (Stacks ephemeral value constraint):

kubectl create secret generic datadog-secret \
  --namespace datadog \
  --from-literal api-key=<YOUR_DATADOG_API_KEY>

Step 3: Apply the DatadogAgent CRD:

kubectl apply -f kubernetes/monitoring/datadog-agent.yaml

Why manual steps? HCP Terraform Stacks treats all store.varset values as ephemeral. They cannot flow into kubernetes_secret resources that persist to state. Additionally, kubernetes_manifest resources cause null value errors during Stacks deferred changes because the CRD namespace doesn't exist at plan time.

DatadogAgent Configuration

Reference manifest at kubernetes/monitoring/datadog-agent.yaml:

Feature Configuration
Cluster Name netlix-dev
Site datadoghq.eu
Cluster Agent Enabled with orchestrator explorer
Cluster Checks Enabled for service discovery
APM Enabled with auto-instrumentation
Auto-instrumentation Java, Python, JS, PHP, .NET, Ruby via libVersions
Log Collection Container logs enabled
Prometheus Scrape Enabled (scrapes Envoy metrics on port 20200)

Unified Service Tags

Both web and api services include Datadog unified service tags for correlated metrics, traces, and logs:

Pod template labels:

labels:
  tags.datadoghq.com/service: "web"    # or "api"
  tags.datadoghq.com/env: "dev"        # via overlay patch
  tags.datadoghq.com/version: "latest" # via overlay patch

Environment variables:

env:
  - name: DD_SERVICE
    value: "web"
  - name: DD_ENV
    value: "dev"
  - name: DD_VERSION
    value: "latest"

These tags enable:

  • Correlated views across metrics, traces, and logs in Datadog
  • Service catalog population
  • Deployment tracking
  • Error tracking by service and version

Verification

# Check Datadog pods
kubectl get pods -n datadog

# Expected pods:
# datadog-agent-<hash>         (DaemonSet - one per node)
# datadog-cluster-agent-<hash> (Deployment)
# datadog-operator-<hash>      (Deployment)

# Check agent status
kubectl exec -it -n datadog $(kubectl get pod -n datadog -l app.kubernetes.io/name=datadog-agent-deployment -o name | head -1) -- agent status

CloudWatch Monitoring

Monitoring Component

The Terraform monitoring component creates CloudWatch infrastructure:

terraform/components/monitoring/
├── main.tf       # SNS topic, metric filters, alarms
├── variables.tf  # environment, project, RDS/EKS identifiers
├── outputs.tf    # SNS topic ARN
└── versions.tf

SNS Notification Topic

Topic: {project}-{environment}-alerts
Subscribers: alert_email (if configured)

CloudWatch Alarms

Alarm Metric Threshold Period Action
VPC Rejected Connections Custom (metric filter on flow logs) > 100 5 min SNS
RDS CPU Utilization CPUUtilization > 80% 5 min SNS
RDS Free Storage FreeStorageSpace < 5 GB 5 min SNS
RDS Database Connections DatabaseConnections > 80% of max 5 min SNS

VPC Flow Log Analysis

VPC Flow Logs capture all network traffic to CloudWatch Logs:

VPC → Flow Log → CloudWatch Log Group → Metric Filter (REJECT) → Alarm → SNS

The metric filter extracts REJECT actions from flow logs and increments a custom CloudWatch metric. When rejected connections exceed 100 in a 5-minute window, the alarm fires and sends an SNS notification.


Service Mesh Metrics

Consul/Envoy Prometheus Metrics

Both web and api pods run Envoy sidecars that expose Prometheus metrics on port 20200:

Metric Description
envoy_http_downstream_rq_total Total requests received
envoy_http_downstream_rq_xx Requests by status code class (2xx, 4xx, 5xx)
envoy_http_downstream_rq_time Request latency histogram
envoy_cluster_upstream_cx_active Active upstream connections
envoy_cluster_upstream_rq_retry Upstream retry count

Scraping Configuration

Datadog Agent scrapes Envoy metrics via Prometheus annotations on the pod:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "20200"
  prometheus.io/path: "/metrics"

Dashboards & Alerts Checklist

Recommended Datadog Dashboards

Dashboard Key Widgets
Service Overview Request rate, error rate, latency (p50/p95/p99) per service
Infrastructure Node CPU/memory, pod restarts, OOMKills
Database Connection count, query latency, replication lag
Network VPC flow log rejections, DNS resolution time

Recommended Datadog Monitors

Monitor Condition Severity
High error rate 5xx > 5% of requests over 5 min Critical
High latency p99 > 2s over 10 min Warning
Pod restarts > 3 restarts in 15 min Critical
OOMKill Any OOMKill event Critical
Node not ready Node NotReady > 5 min Critical
Certificate expiry < 7 days to expiry Warning

Observability Data Flow

┌─────────────┐     ┌──────────────┐     ┌───────────────┐
│ Application  │────▶│ Datadog Agent│────▶│ Datadog Cloud │
│ Pods         │     │ (DaemonSet)  │     │ (datadoghq.eu)│
│              │     │              │     │               │
│ APM traces   │     │ Collects:    │     │ Dashboards    │
│ DD_* env vars│     │ - traces     │     │ Monitors      │
│ stdout logs  │     │ - metrics    │     │ APM           │
└─────────────┘     │ - logs       │     │ Logs          │
                     └──────────────┘     └───────────────┘

┌─────────────┐     ┌──────────────┐     ┌───────────────┐
│ VPC/RDS/EKS │────▶│  CloudWatch  │────▶│     SNS       │
│              │     │              │     │               │
│ Flow logs    │     │ Metric       │     │ Email alerts  │
│ RDS metrics  │     │ filters      │     │               │
│ EKS metrics  │     │ Alarms       │     │               │
└─────────────┘     └──────────────┘     └───────────────┘

┌─────────────┐     ┌──────────────┐
│ Envoy       │────▶│ Prometheus   │────▶ Scraped by Datadog Agent
│ Sidecars    │     │ :20200       │
│              │     │ /metrics     │
└─────────────┘     └──────────────┘

Clone this wiki locally