This runbook defines minimum operational controls for high-assurance deployment of pqmsg-server.
Primary goals:
- sustain service availability under abuse and partial failure,
- preserve security-event visibility and response capability,
- recover to a defined recovery point objective (RPO) and recovery time objective (RTO).
flowchart LR
A[pqmsg-server replicas] --> B[Prometheus]
A --> C[Loki / Audit Logs]
A -->|PII-scrubbed| C
B --> D[11 Alert Rules]
D --> AM[Alertmanager]
AM --> E[On-call Escalation]
A --> F[PostgreSQL]
A --> G[Redis Rate Limiter]
A -->|circuit breaker| PN[FCM / APNS Push]
F --> H[Encrypted Backups]
Production baselines:
- availability SLO:
>= 99.9%per rolling 30 days, - 95th percentile request latency target:
<= 250msfor authenticated API paths, - authenticated reject anomaly target: sustained spikes investigated within incident window.
Prometheus alert rules are defined in:
observability/prometheus/alert-rules.yml
Current baseline alerts (11 rules):
- sustained 5xx ratio above
2%for10m(critical), - auth reject spike for signature/replay/skew events (high),
- sustained rate-limit reject spike (high),
- sustained high in-flight request pressure (medium),
- push circuit breaker open — FCM/APNS (critical),
- signed prekey staleness — rotation failures (high),
- PQ prekey pool depletion — last-resort bundle served (high),
- device revocation spike (high),
- PQ ratchet stall — no progress while messages flow (high),
- nonce replay burst — active attack signal (critical),
- registration spike — bot activity (medium).
Escalation routing is defined in:
observability/alertmanager/alertmanager.yml
Receiver mapping:
oncall-critical:severity=critical,oncall-high:severity=high,oncall-standard:severity in {medium, low}.
Production deployment must bind SMTP settings and receiver mailbox targets:
ALERT_EMAIL_SMARTHOST,ALERT_EMAIL_FROM,ALERT_EMAIL_CRITICAL_TO,ALERT_EMAIL_HIGH_TO,ALERT_EMAIL_STANDARD_TO.
Optional deployment-governance escalation input:
PQMSG_ALERTMANAGER_API_URLfor GitHub promotion/rollback workflows that should submit incident alerts directly to Alertmanager when rollout governance fails.PQMSG_INCIDENT_ISSUE_REPOfor GitHub promotion/rollback workflows that should publish the same incident into a durable GitHub issue record (current repo or dedicated incident repo).
SEV-1: confidentiality/integrity risk, active exploitation, or full service outage.SEV-2: major degradation or sustained security control failure.SEV-3: localized degradation or non-exploited hardening gap.
SEV-1: acknowledge within15m; mitigation owner assigned immediately.SEV-2: acknowledge within30m.SEV-3: acknowledge within4h.
- confirm signal validity from Prometheus metrics + audit log evidence,
- isolate scope (affected users/devices/regions/build version),
- execute mitigation playbook (rate-limit hardening, credential rotation, rollback),
- preserve forensic artifacts (request IDs, audit JSONL, deployment digest),
- publish incident timeline and closure criteria.
Run an escalation drill at least once per release cycle:
- ensure observability stack is running (
docker-compose.observability.yml), - execute:
./scripts/security/alert_drill.shor./scripts/security/alert_drill.ps1,
- verify alert fan-out in sink logs:
- query Mailpit API (
http://127.0.0.1:8025/api/v1/messages) in local stack,
- query Mailpit API (
- attach drill evidence (timestamp + captured output) to release record.
Promotion/rollback failure governance now also emits structured incident handoff records and, when PQMSG_ALERTMANAGER_API_URL is configured in the GitHub Environment, submits Alertmanager-compatible incident alerts automatically from the workflow failure path.
The resulting bundle now includes a submission record showing whether the Alertmanager handoff was skipped, attempted, delivered, or failed, so on-call can distinguish governance failure from escalation-delivery failure.
When PQMSG_INCIDENT_ISSUE_REPO is configured, the workflow also creates or updates a GitHub issue for the incident, records that publication outcome in the evidence bundle, and applies the shared pqmsg-* incident labels so on-call can filter by environment, deployment mode, operation, and open/resolved state.
Successful remediation runs now use the same deployment scope to comment on and close older open incident issues, so the issue tracker reflects both incident creation and incident resolution.
The final uploaded bundle manifest digest is also commented back onto the incident issue thread, so operators can match the issue directly to the exact evidence bundle contents that were archived.
Those workflow-driven issue comments are deduplicated by hidden markers, so retries update evidence safely without piling duplicate comments onto the same incident.
- RPO:
<= 15 minutes, - RTO:
<= 60 minutes.
- PostgreSQL persistent state: mandatory backup,
- Redis limiter state: ephemeral and reconstructable,
- audit logs: append-only retention with off-host copy.
- PostgreSQL base backup: daily,
- WAL/archive incremental: every
5-15m, - audit JSONL export: near-real-time or every
15m.
- restore test to isolated environment: monthly,
- full disaster simulation (database + node replacement): quarterly.
Maintain these artifacts per release cycle:
- alert history export and resolved incident records,
- backup success logs and restore drill output,
- on-call handoff log with unresolved risks,
- deployment manifest digest and rollback artifact mapping.
If a hardened non-production deployment still uses SQLite, rotate SQLCipher keys offline:
- stop all writers,
- take a backup or snapshot,
- run the offline rotation tool from SQLITE_KEY_ROTATION.md,
- update the secret/config to the new key only,
- restart and validate
/health.
Do not treat SQLite key rotation as an online procedure. The supported operational model is maintenance-window rotation with a verified rollback path.