Backend add ingest lag alerting thresholds and degraded mode auto recovery to lag monitor by mikkyvans0-source · Pull Request #1290 · QuickLendX/quicklendx-protocol

mikkyvans0-source · 2026-06-02T11:34:10Z

What was done
backend/src/services/lagMonitor.ts — extended with:

New config: LAG_HYSTERESIS_MARGIN (default 3) and LAG_RECOVERY_POLLS (default 3) env vars + constructor args, alongside the existing warn/critical thresholds. Added setHysteresis() with validation and getters (hysteresisMargin, recoveryPolls, effectiveLevel).
Hysteresis state machine (observe() / poll()): escalation is immediate; de-escalation requires the lag to stay at/below the recovery threshold (threshold − margin) for N consecutive polls, stepping down one level at a time (critical → warn → none) so the warn guard window is never skipped.
Alert events on transitions only — never per poll. Each transition logs one structured JSON line (type: "LAG_ALERT"), bumps in-process counters (getAlertMetrics()), and notifies onAlert() subscribers. Payloads carry operational fields only (lag, thresholds, level, timestamp) — no secrets.
Fail-safe reads: a missing/throwing/negative current-ledger read is treated as critical, so a bad read can never clear a degraded state or open the write guard.
Preserved contracts: getLagStatus() stays instantaneous and side-effect-free, so /status, the readiness probe, and degradedGuard are unaffected. Added getEffectiveStatus() for guards wanting hysteresis-aware auto-recovery.
backend/src/tests/lagMonitor.alerts.test.ts (new) — 28 tests covering the required edge cases: flapping around a threshold, sustained breach, rapid recovery, missing/corrupt reads, transition-only alerting, payload safety, and metrics/subscription lifecycle.

backend/docs/observability.md (new) — documents thresholds, hysteresis/auto-recovery, alert payloads, polling, metrics, and edge cases.
Closes #1069

Test results
lagMonitor suites: 59/59 pass (30 existing + 29 new).
Full suite: no regressions — 20 failed suites / 31 failed tests both before and after my change (all pre-existing, e.g. a kycService.ts parse error). My change added 1 passing suite (+29 passing tests).

The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages.

…dency-probes' of https://github.com/mikkyvans0-source/quicklendx-protocol into Backend--Add-health/readiness/liveness-split-with-dependency-probes

…olds and add corresponding unit tests and documentation

drips-wave · 2026-06-02T11:35:39Z

@mikkyvans0-source Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

mikkyvans0-source and others added 4 commits June 2, 2026 12:02

Merge branch 'Backend--Add-health/readiness/liveness-split-with-depen…

914818a

…dency-probes' of https://github.com/mikkyvans0-source/quicklendx-protocol into Backend--Add-health/readiness/liveness-split-with-dependency-probes

feat: implement LagMonitor service with hysteresis-based alert thresh…

88741b3

…olds and add corresponding unit tests and documentation

Baskarayelu merged commit 88741b3 into QuickLendX:main Jun 3, 2026
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend add ingest lag alerting thresholds and degraded mode auto recovery to lag monitor#1290

Backend add ingest lag alerting thresholds and degraded mode auto recovery to lag monitor#1290
Baskarayelu merged 4 commits into
QuickLendX:mainfrom
mikkyvans0-source:Backend--Add-ingest-lag-alerting-thresholds-and-degraded-mode-auto-recovery-to-lagMonitor

mikkyvans0-source commented Jun 2, 2026 •

edited

Loading

Uh oh!

drips-wave Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikkyvans0-source commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drips-wave Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikkyvans0-source commented Jun 2, 2026 •

edited

Loading