-
Notifications
You must be signed in to change notification settings - Fork 1
Monitoring
The lab includes a full observability stack for PKI infrastructure monitoring: metrics collection, log aggregation, dashboards, and alerting.
| Component | Port | Purpose |
|---|---|---|
| Prometheus | 9090 | Metrics collection and alerting |
| Grafana | 3000 | Dashboards and visualization |
| PKI Exporter | 9091 | Scrapes Dogtag CAs, exposes metrics |
| Loki | 3100 | Log aggregation |
| Promtail | 9080 | Log shipping agent |
URL: http://localhost:9090
Prometheus scrapes the PKI Exporter every 15 seconds and evaluates alert rules.
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/alert-rules.yml
scrape_configs:
- job_name: "pki-exporter"
scrape_interval: 15s
static_configs:
- targets: ["pki-exporter.cert-lab.local:9091"]Configuration file: configs/prometheus/prometheus.yml
URL: http://localhost:9091/metrics
A FastAPI application (containers/pki-exporter/app.py) that scrapes all 9 Dogtag CAs across the three PKI hierarchies and exposes Prometheus-format metrics.
| Metric | Labels | Description |
|---|---|---|
pki_ca_up |
pki_type, ca_level
|
CA health: 1=up, 0=down |
pki_certificates_total |
pki_type, ca_level, status
|
Certificate counts by status |
pki_certificates_expiring_total |
pki_type, ca_level, window
|
Certs expiring within time window |
pki_ocsp_response_seconds |
pki_type, ca_level
|
Built-in OCSP query latency (seconds) |
pki_ocsp_responder_up |
pki_type |
Dedicated OCSP responder health |
pki_ocsp_responder_response_seconds |
pki_type |
Dedicated OCSP responder latency |
pki_crl_last_update_timestamp |
pki_type, ca_level
|
CRL last update as Unix timestamp |
pki_crl_next_update_timestamp |
pki_type, ca_level
|
CRL next update as Unix timestamp |
pki_crl_entries_total |
pki_type, ca_level
|
Revoked entries in the CRL |
pki_issuance_total |
pki_type |
Total certs issued (perf test) |
pki_revocation_total |
pki_type |
Total certs revoked (perf test) |
pki_issuance_rate |
pki_type |
Issuance throughput (certs/second) |
pki_revocation_rate |
pki_type |
Revocation throughput (certs/second) |
pki_issuance_duration_seconds |
pki_type, quantile
|
Issuance latency percentiles |
| Label | Values |
|---|---|
pki_type |
rsa, ecc, pqc
|
ca_level |
root, intermediate, iot
|
status |
valid, revoked, expired
|
window |
7d, 30d, 90d
|
quantile |
0.5, 0.95, 0.99
|
URL: http://localhost:3000
Credentials: admin / value of ADMIN_PASSWORD from .env
Grafana is auto-provisioned with the PKI dashboard and datasources. See the Grafana Dashboards wiki page for full panel documentation.
| Datasource | Type | URL |
|---|---|---|
| Prometheus | prometheus | http://prometheus.cert-lab.local:9090 |
| Loki | loki | http://loki.cert-lab.local:3100 |
URL: http://localhost:3100
Loki provides log aggregation for PKI audit logs, EDA event logs, and security events. Logs can be queried through Grafana's Explore interface using LogQL.
# All Dogtag audit logs for RSA PKI
{job="dogtag_audit", pki_type="rsa"}
# EDA event logs
{job="eda_events"}
# Security events from Kafka
{job="security_events"}
# Filter by CA level
{job="dogtag_audit", ca_level="iot"}
# Search for revocation events
{job="dogtag_audit"} |= "CERT_STATUS_CHANGE"
Listen port: 9080 (HTTP)
Promtail ships logs from the host filesystem to Loki. Configuration file: configs/promtail/promtail-config.yml
| Job Name | Description | Log Path |
|---|---|---|
dogtag_audit |
Dogtag PKI CA audit logs | /var/log/pki/{rsa,ecc,pq}/*/*.log |
eda_logs |
Event-Driven Ansible event logs | /var/log/eda/*.log |
security_events |
Kafka security event logs | /var/log/security-events/*.log |
Each log stream is labeled for filtering in Grafana/LogQL:
| Label | Applied To | Values |
|---|---|---|
job |
All jobs |
dogtag_audit, eda_events, security_events
|
pki_type |
dogtag_audit |
rsa, ecc, pqc
|
ca_level |
dogtag_audit |
root, intermediate, iot
|
level |
All jobs | Extracted from log line |
source |
dogtag_audit |
Extracted from audit log |
event |
dogtag_audit |
Extracted from audit log |
The dogtag_audit job parses structured Dogtag audit log lines:
[2025-01-15T10:30:00.000-0500][AuditService][INFO][CERT_STATUS_CHANGE] ...
Regex pattern:
^\[(?P<timestamp>[^\]]+)\]\[(?P<source>[^\]]+)\]\[(?P<level>[^\]]+)\]\[(?P<event>[^\]]+)\](?P<message>.*)
Extracted fields (source, level, event) are promoted to labels. The timestamp field is parsed with format 2006-01-02T15:04:05.000-0700.
Alert rules are defined in configs/prometheus/alert-rules.yml and loaded by Prometheus.
| Alert | Expression | For | Severity | Description |
|---|---|---|---|---|
OCSPResponseSlow |
pki_ocsp_response_seconds > 2 |
1m | warning | Built-in OCSP response exceeds 2 seconds |
OCSPResponseCritical |
pki_ocsp_response_seconds > 5 |
1m | critical | Built-in OCSP response exceeds 5 seconds |
OCSPResponderDown |
pki_ocsp_responder_up == 0 |
2m | critical | Dedicated OCSP responder unreachable |
OCSPResponderSlow |
pki_ocsp_responder_response_seconds > 2 |
1m | warning | Dedicated OCSP responder exceeds 2 seconds |
| Alert | Expression | For | Severity | Description |
|---|---|---|---|---|
CADown |
pki_ca_up == 0 |
2m | critical | CA unreachable for more than 2 minutes |
CertificatesExpiringSoon |
pki_certificates_expiring_total{window="7d"} > 0 |
5m | warning | Certificates expiring within 7 days |
HighRevocationRate |
pki_revocation_rate > 10 |
1m | warning | Revocation rate exceeds 10 certs/second |
| Alert | Expression | For | Severity | Description |
|---|---|---|---|---|
CTLogDown |
pki_ct_log_up == 0 |
2m | warning | Certificate Transparency log unreachable |