Problem
The app has structured logs and queues, but operators need metrics and traces to diagnose latency, queue stalls, webhook failures, auth anomalies, and AI cost drivers.
Proposed solution
Add OpenTelemetry instrumentation and a Prometheus-compatible metrics endpoint. Track request latency, response status, API rate-limit events, queue depth, job duration, webhook delivery outcomes, storage operations, database query latency, and AI token usage.
Acceptance criteria
- Metrics endpoint can be enabled for self-hosted deployments.
- HTTP request metrics include route pattern, method, status, and latency buckets.
- Queue metrics include waiting, active, failed, completed, stalled, and job duration.
- Webhook metrics include attempts, successes, failures, retries, and latency.
- AI metrics include model, feature, token count, failure count, and cost estimate where available.
- Tracing can be exported to an OTLP collector.
- Sensitive payloads, tokens, emails, and secrets are not included in spans or metrics labels.
Problem
The app has structured logs and queues, but operators need metrics and traces to diagnose latency, queue stalls, webhook failures, auth anomalies, and AI cost drivers.
Proposed solution
Add OpenTelemetry instrumentation and a Prometheus-compatible metrics endpoint. Track request latency, response status, API rate-limit events, queue depth, job duration, webhook delivery outcomes, storage operations, database query latency, and AI token usage.
Acceptance criteria