Observability

The platform implements a multi-layer observability stack covering application performance monitoring (Datadog), infrastructure alarms (CloudWatch), network visibility (VPC Flow Logs), and service mesh metrics (Consul/Envoy).

Observability Stack

Layer	Tool	What It Covers
APM & Traces	Datadog	Request traces, latency, error rates across web/api
Log Collection	Datadog Agent	Container stdout/stderr from all pods
Infrastructure Metrics	Datadog Agent	CPU, memory, disk, network for nodes and pods
Cluster Monitoring	Datadog Cluster Agent	Kubernetes events, deployments, node health
Infrastructure Alarms	CloudWatch	RDS health, VPC flow log anomalies
Network Visibility	VPC Flow Logs	All network traffic (ACCEPT + REJECT)
Service Mesh Metrics	Consul/Envoy	mTLS connections, request routing, circuit breaking

Datadog

Architecture

The Datadog Operator pattern is used for deployment:

Terraform (datadog-operator component)
    │
    ├── kubernetes_namespace_v1 "datadog"
    └── helm_release "datadog-operator" (v2.1.0)
            │
            ▼
Datadog Operator (controller)
    │
    ▼
DatadogAgent CRD (applied via kubectl)
    │
    ├── Node Agent (DaemonSet) ─── metrics, logs, APM traces
    ├── Cluster Agent ──────────── Kubernetes API, events, orchestrator
    └── Cluster Checks Runner ──── service discovery checks

Deployment

Step 1: Terraform creates the namespace and operator:

# terraform/components/datadog-operator/main.tf
resource "kubernetes_namespace_v1" "datadog" {
  metadata { name = "datadog" }
}

resource "helm_release" "datadog_operator" {
  name       = "datadog-operator"
  namespace  = "datadog"
  repository = "https://helm.datadoghq.com"
  chart      = "datadog-operator"
  version    = "2.1.0"
}

Step 2: Create the API key secret manually (Stacks ephemeral value constraint):

kubectl create secret generic datadog-secret \
  --namespace datadog \
  --from-literal api-key=<YOUR_DATADOG_API_KEY>

Step 3: Apply the DatadogAgent CRD:

kubectl apply -f kubernetes/monitoring/datadog-agent.yaml

Why manual steps? HCP Terraform Stacks treats all store.varset values as ephemeral. They cannot flow into kubernetes_secret resources that persist to state. Additionally, kubernetes_manifest resources cause null value errors during Stacks deferred changes because the CRD namespace doesn't exist at plan time.

DatadogAgent Configuration

Reference manifest at kubernetes/monitoring/datadog-agent.yaml:

Feature	Configuration
Cluster Name	`netlix-dev`
Site	`datadoghq.eu`
Cluster Agent	Enabled with orchestrator explorer
Cluster Checks	Enabled for service discovery
APM	Enabled with auto-instrumentation
Auto-instrumentation	Java, Python, JS, PHP, .NET, Ruby via `libVersions`
Log Collection	Container logs enabled
Prometheus Scrape	Enabled (scrapes Envoy metrics on port 20200)

Unified Service Tags

Both web and api services include Datadog unified service tags for correlated metrics, traces, and logs:

Pod template labels:

labels:
  tags.datadoghq.com/service: "web"    # or "api"
  tags.datadoghq.com/env: "dev"        # via overlay patch
  tags.datadoghq.com/version: "latest" # via overlay patch

Environment variables:

env:
  - name: DD_SERVICE
    value: "web"
  - name: DD_ENV
    value: "dev"
  - name: DD_VERSION
    value: "latest"

These tags enable:

Correlated views across metrics, traces, and logs in Datadog
Service catalog population
Deployment tracking
Error tracking by service and version

Verification

# Check Datadog pods
kubectl get pods -n datadog

# Expected pods:
# datadog-agent-<hash>         (DaemonSet - one per node)
# datadog-cluster-agent-<hash> (Deployment)
# datadog-operator-<hash>      (Deployment)

# Check agent status
kubectl exec -it -n datadog $(kubectl get pod -n datadog -l app.kubernetes.io/name=datadog-agent-deployment -o name | head -1) -- agent status

CloudWatch Monitoring

Monitoring Component

The Terraform monitoring component creates CloudWatch infrastructure:

terraform/components/monitoring/
├── main.tf       # SNS topic, metric filters, alarms
├── variables.tf  # environment, project, RDS/EKS identifiers
├── outputs.tf    # SNS topic ARN
└── versions.tf

SNS Notification Topic

Topic: {project}-{environment}-alerts
Subscribers: alert_email (if configured)

CloudWatch Alarms

Alarm	Metric	Threshold	Period	Action
VPC Rejected Connections	Custom (metric filter on flow logs)	> 100	5 min	SNS
RDS CPU Utilization	`CPUUtilization`	> 80%	5 min	SNS
RDS Free Storage	`FreeStorageSpace`	< 5 GB	5 min	SNS
RDS Database Connections	`DatabaseConnections`	> 80% of max	5 min	SNS

VPC Flow Log Analysis

VPC Flow Logs capture all network traffic to CloudWatch Logs:

VPC → Flow Log → CloudWatch Log Group → Metric Filter (REJECT) → Alarm → SNS

The metric filter extracts REJECT actions from flow logs and increments a custom CloudWatch metric. When rejected connections exceed 100 in a 5-minute window, the alarm fires and sends an SNS notification.

Service Mesh Metrics

Consul/Envoy Prometheus Metrics

Both web and api pods run Envoy sidecars that expose Prometheus metrics on port 20200:

Metric	Description
`envoy_http_downstream_rq_total`	Total requests received
`envoy_http_downstream_rq_xx`	Requests by status code class (2xx, 4xx, 5xx)
`envoy_http_downstream_rq_time`	Request latency histogram
`envoy_cluster_upstream_cx_active`	Active upstream connections
`envoy_cluster_upstream_rq_retry`	Upstream retry count

Scraping Configuration

Datadog Agent scrapes Envoy metrics via Prometheus annotations on the pod:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "20200"
  prometheus.io/path: "/metrics"

Dashboards & Alerts Checklist

Recommended Datadog Dashboards

Dashboard	Key Widgets
Service Overview	Request rate, error rate, latency (p50/p95/p99) per service
Infrastructure	Node CPU/memory, pod restarts, OOMKills
Database	Connection count, query latency, replication lag
Network	VPC flow log rejections, DNS resolution time

Recommended Datadog Monitors

Monitor	Condition	Severity
High error rate	5xx > 5% of requests over 5 min	Critical
High latency	p99 > 2s over 10 min	Warning
Pod restarts	> 3 restarts in 15 min	Critical
OOMKill	Any OOMKill event	Critical
Node not ready	Node NotReady > 5 min	Critical
Certificate expiry	< 7 days to expiry	Warning

Observability Data Flow

┌─────────────┐     ┌──────────────┐     ┌───────────────┐
│ Application  │────▶│ Datadog Agent│────▶│ Datadog Cloud │
│ Pods         │     │ (DaemonSet)  │     │ (datadoghq.eu)│
│              │     │              │     │               │
│ APM traces   │     │ Collects:    │     │ Dashboards    │
│ DD_* env vars│     │ - traces     │     │ Monitors      │
│ stdout logs  │     │ - metrics    │     │ APM           │
└─────────────┘     │ - logs       │     │ Logs          │
                     └──────────────┘     └───────────────┘

┌─────────────┐     ┌──────────────┐     ┌───────────────┐
│ VPC/RDS/EKS │────▶│  CloudWatch  │────▶│     SNS       │
│              │     │              │     │               │
│ Flow logs    │     │ Metric       │     │ Email alerts  │
│ RDS metrics  │     │ filters      │     │               │
│ EKS metrics  │     │ Alarms       │     │               │
└─────────────┘     └──────────────┘     └───────────────┘

┌─────────────┐     ┌──────────────┐
│ Envoy       │────▶│ Prometheus   │────▶ Scraped by Datadog Agent
│ Sidecars    │     │ :20200       │
│              │     │ /metrics     │
└─────────────┘     └──────────────┘

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability

Observability

Observability Stack

Datadog

Architecture

Deployment

DatadogAgent Configuration

Unified Service Tags

Verification

CloudWatch Monitoring

Monitoring Component

SNS Notification Topic

CloudWatch Alarms

VPC Flow Log Analysis

Service Mesh Metrics

Consul/Envoy Prometheus Metrics

Scraping Configuration

Dashboards & Alerts Checklist

Recommended Datadog Dashboards

Recommended Datadog Monitors

Observability Data Flow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally