Skip to content

Monitoring

Chris Zinda edited this page Mar 6, 2026 · 1 revision

Monitoring

The lab includes a full observability stack for PKI infrastructure monitoring: metrics collection, log aggregation, dashboards, and alerting.

Stack Overview

Component Port Purpose
Prometheus 9090 Metrics collection and alerting
Grafana 3000 Dashboards and visualization
PKI Exporter 9091 Scrapes Dogtag CAs, exposes metrics
Loki 3100 Log aggregation
Promtail 9080 Log shipping agent

Prometheus

URL: http://localhost:9090

Prometheus scrapes the PKI Exporter every 15 seconds and evaluates alert rules.

Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/alert-rules.yml

scrape_configs:
  - job_name: "pki-exporter"
    scrape_interval: 15s
    static_configs:
      - targets: ["pki-exporter.cert-lab.local:9091"]

Configuration file: configs/prometheus/prometheus.yml

PKI Exporter

URL: http://localhost:9091/metrics

A FastAPI application (containers/pki-exporter/app.py) that scrapes all 9 Dogtag CAs across the three PKI hierarchies and exposes Prometheus-format metrics.

Exported Metrics

Metric Labels Description
pki_ca_up pki_type, ca_level CA health: 1=up, 0=down
pki_certificates_total pki_type, ca_level, status Certificate counts by status
pki_certificates_expiring_total pki_type, ca_level, window Certs expiring within time window
pki_ocsp_response_seconds pki_type, ca_level Built-in OCSP query latency (seconds)
pki_ocsp_responder_up pki_type Dedicated OCSP responder health
pki_ocsp_responder_response_seconds pki_type Dedicated OCSP responder latency
pki_crl_last_update_timestamp pki_type, ca_level CRL last update as Unix timestamp
pki_crl_next_update_timestamp pki_type, ca_level CRL next update as Unix timestamp
pki_crl_entries_total pki_type, ca_level Revoked entries in the CRL
pki_issuance_total pki_type Total certs issued (perf test)
pki_revocation_total pki_type Total certs revoked (perf test)
pki_issuance_rate pki_type Issuance throughput (certs/second)
pki_revocation_rate pki_type Revocation throughput (certs/second)
pki_issuance_duration_seconds pki_type, quantile Issuance latency percentiles

Label Values

Label Values
pki_type rsa, ecc, pqc
ca_level root, intermediate, iot
status valid, revoked, expired
window 7d, 30d, 90d
quantile 0.5, 0.95, 0.99

Grafana

URL: http://localhost:3000

Credentials: admin / value of ADMIN_PASSWORD from .env

Grafana is auto-provisioned with the PKI dashboard and datasources. See the Grafana Dashboards wiki page for full panel documentation.

Datasources

Datasource Type URL
Prometheus prometheus http://prometheus.cert-lab.local:9090
Loki loki http://loki.cert-lab.local:3100

Loki

URL: http://localhost:3100

Loki provides log aggregation for PKI audit logs, EDA event logs, and security events. Logs can be queried through Grafana's Explore interface using LogQL.

Example Queries

# All Dogtag audit logs for RSA PKI
{job="dogtag_audit", pki_type="rsa"}

# EDA event logs
{job="eda_events"}

# Security events from Kafka
{job="security_events"}

# Filter by CA level
{job="dogtag_audit", ca_level="iot"}

# Search for revocation events
{job="dogtag_audit"} |= "CERT_STATUS_CHANGE"

Promtail

Listen port: 9080 (HTTP)

Promtail ships logs from the host filesystem to Loki. Configuration file: configs/promtail/promtail-config.yml

Scrape Jobs

Job Name Description Log Path
dogtag_audit Dogtag PKI CA audit logs /var/log/pki/{rsa,ecc,pq}/*/*.log
eda_logs Event-Driven Ansible event logs /var/log/eda/*.log
security_events Kafka security event logs /var/log/security-events/*.log

Labels

Each log stream is labeled for filtering in Grafana/LogQL:

Label Applied To Values
job All jobs dogtag_audit, eda_events, security_events
pki_type dogtag_audit rsa, ecc, pqc
ca_level dogtag_audit root, intermediate, iot
level All jobs Extracted from log line
source dogtag_audit Extracted from audit log
event dogtag_audit Extracted from audit log

Pipeline Stages

The dogtag_audit job parses structured Dogtag audit log lines:

[2025-01-15T10:30:00.000-0500][AuditService][INFO][CERT_STATUS_CHANGE] ...

Regex pattern:

^\[(?P<timestamp>[^\]]+)\]\[(?P<source>[^\]]+)\]\[(?P<level>[^\]]+)\]\[(?P<event>[^\]]+)\](?P<message>.*)

Extracted fields (source, level, event) are promoted to labels. The timestamp field is parsed with format 2006-01-02T15:04:05.000-0700.

Alert Rules

Alert rules are defined in configs/prometheus/alert-rules.yml and loaded by Prometheus.

Alert Groups

pki_ocsp_alerts

Alert Expression For Severity Description
OCSPResponseSlow pki_ocsp_response_seconds > 2 1m warning Built-in OCSP response exceeds 2 seconds
OCSPResponseCritical pki_ocsp_response_seconds > 5 1m critical Built-in OCSP response exceeds 5 seconds
OCSPResponderDown pki_ocsp_responder_up == 0 2m critical Dedicated OCSP responder unreachable
OCSPResponderSlow pki_ocsp_responder_response_seconds > 2 1m warning Dedicated OCSP responder exceeds 2 seconds

pki_ca_alerts

Alert Expression For Severity Description
CADown pki_ca_up == 0 2m critical CA unreachable for more than 2 minutes
CertificatesExpiringSoon pki_certificates_expiring_total{window="7d"} > 0 5m warning Certificates expiring within 7 days
HighRevocationRate pki_revocation_rate > 10 1m warning Revocation rate exceeds 10 certs/second

pki_ct_log_alerts

Alert Expression For Severity Description
CTLogDown pki_ct_log_up == 0 2m warning Certificate Transparency log unreachable

Clone this wiki locally