Backend add ingest lag alerting thresholds and degraded mode auto recovery to lag monitor#1290
Merged
Conversation
The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages.
…dency-probes' of https://github.com/mikkyvans0-source/quicklendx-protocol into Backend--Add-health/readiness/liveness-split-with-dependency-probes
…olds and add corresponding unit tests and documentation
|
@mikkyvans0-source Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What was done
backend/src/services/lagMonitor.ts — extended with:
New config: LAG_HYSTERESIS_MARGIN (default 3) and LAG_RECOVERY_POLLS (default 3) env vars + constructor args, alongside the existing warn/critical thresholds. Added setHysteresis() with validation and getters (hysteresisMargin, recoveryPolls, effectiveLevel).
Hysteresis state machine (observe() / poll()): escalation is immediate; de-escalation requires the lag to stay at/below the recovery threshold (threshold − margin) for N consecutive polls, stepping down one level at a time (critical → warn → none) so the warn guard window is never skipped.
Alert events on transitions only — never per poll. Each transition logs one structured JSON line (type: "LAG_ALERT"), bumps in-process counters (getAlertMetrics()), and notifies onAlert() subscribers. Payloads carry operational fields only (lag, thresholds, level, timestamp) — no secrets.
Fail-safe reads: a missing/throwing/negative current-ledger read is treated as critical, so a bad read can never clear a degraded state or open the write guard.
Preserved contracts: getLagStatus() stays instantaneous and side-effect-free, so /status, the readiness probe, and degradedGuard are unaffected. Added getEffectiveStatus() for guards wanting hysteresis-aware auto-recovery.
backend/src/tests/lagMonitor.alerts.test.ts (new) — 28 tests covering the required edge cases: flapping around a threshold, sustained breach, rapid recovery, missing/corrupt reads, transition-only alerting, payload safety, and metrics/subscription lifecycle.
backend/docs/observability.md (new) — documents thresholds, hysteresis/auto-recovery, alert payloads, polling, metrics, and edge cases.
Closes #1069
Test results
lagMonitor suites: 59/59 pass (30 existing + 29 new).
Full suite: no regressions — 20 failed suites / 31 failed tests both before and after my change (all pre-existing, e.g. a kycService.ts parse error). My change added 1 passing suite (+29 passing tests).