Skip to content

Backend add ingest lag alerting thresholds and degraded mode auto recovery to lag monitor#1290

Merged
Baskarayelu merged 4 commits into
QuickLendX:mainfrom
mikkyvans0-source:Backend--Add-ingest-lag-alerting-thresholds-and-degraded-mode-auto-recovery-to-lagMonitor
Jun 3, 2026
Merged

Backend add ingest lag alerting thresholds and degraded mode auto recovery to lag monitor#1290
Baskarayelu merged 4 commits into
QuickLendX:mainfrom
mikkyvans0-source:Backend--Add-ingest-lag-alerting-thresholds-and-degraded-mode-auto-recovery-to-lagMonitor

Conversation

@mikkyvans0-source
Copy link
Copy Markdown
Contributor

@mikkyvans0-source mikkyvans0-source commented Jun 2, 2026

What was done
backend/src/services/lagMonitor.ts — extended with:

New config: LAG_HYSTERESIS_MARGIN (default 3) and LAG_RECOVERY_POLLS (default 3) env vars + constructor args, alongside the existing warn/critical thresholds. Added setHysteresis() with validation and getters (hysteresisMargin, recoveryPolls, effectiveLevel).
Hysteresis state machine (observe() / poll()): escalation is immediate; de-escalation requires the lag to stay at/below the recovery threshold (threshold − margin) for N consecutive polls, stepping down one level at a time (critical → warn → none) so the warn guard window is never skipped.
Alert events on transitions only — never per poll. Each transition logs one structured JSON line (type: "LAG_ALERT"), bumps in-process counters (getAlertMetrics()), and notifies onAlert() subscribers. Payloads carry operational fields only (lag, thresholds, level, timestamp) — no secrets.
Fail-safe reads: a missing/throwing/negative current-ledger read is treated as critical, so a bad read can never clear a degraded state or open the write guard.
Preserved contracts: getLagStatus() stays instantaneous and side-effect-free, so /status, the readiness probe, and degradedGuard are unaffected. Added getEffectiveStatus() for guards wanting hysteresis-aware auto-recovery.
backend/src/tests/lagMonitor.alerts.test.ts (new) — 28 tests covering the required edge cases: flapping around a threshold, sustained breach, rapid recovery, missing/corrupt reads, transition-only alerting, payload safety, and metrics/subscription lifecycle.

backend/docs/observability.md (new) — documents thresholds, hysteresis/auto-recovery, alert payloads, polling, metrics, and edge cases.
Closes #1069

Test results
lagMonitor suites: 59/59 pass (30 existing + 29 new).
Full suite: no regressions — 20 failed suites / 31 failed tests both before and after my change (all pre-existing, e.g. a kycService.ts parse error). My change added 1 passing suite (+29 passing tests).

mikkyvans0-source and others added 4 commits June 2, 2026 12:02
The app exposed a single flat /health that always returned status: ok, with
no distinction between liveness (process up) and readiness (dependencies
healthy). Orchestrators could therefore route traffic to instances that were
up but unable to serve.

Split the signal into two probes, mounted at the root and unauthenticated so
orchestrators can reach them:

- /health, /livez  — cheap, dependency-free liveness check (always 200).
- /readyz          — readiness check that probes DB connectivity
  (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours
  maintenance mode, and returns 503 when not ready. Reuses the SubStatus /
  degradation pattern from the monitoring route: "degraded" stays in rotation,
  "unavailable" fails readiness.

Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering
DB-down, critical lag, maintenance mode, partial failure and queue-saturation
edge cases plus information-leak checks, and docs/health.md documenting probe
semantics. Probe responses expose only coarse status enums — no hostnames,
versions, ledger numbers, or error messages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The app exposed a single flat /health that always returned status: ok, with
no distinction between liveness (process up) and readiness (dependencies
healthy). Orchestrators could therefore route traffic to instances that were
up but unable to serve.

Split the signal into two probes, mounted at the root and unauthenticated so
orchestrators can reach them:

- /health, /livez  — cheap, dependency-free liveness check (always 200).
- /readyz          — readiness check that probes DB connectivity
  (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours
  maintenance mode, and returns 503 when not ready. Reuses the SubStatus /
  degradation pattern from the monitoring route: "degraded" stays in rotation,
  "unavailable" fails readiness.

Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering
DB-down, critical lag, maintenance mode, partial failure and queue-saturation
edge cases plus information-leak checks, and docs/health.md documenting probe
semantics. Probe responses expose only coarse status enums — no hostnames,
versions, ledger numbers, or error messages.
…dency-probes' of https://github.com/mikkyvans0-source/quicklendx-protocol into Backend--Add-health/readiness/liveness-split-with-dependency-probes
…olds and add corresponding unit tests and documentation
@drips-wave
Copy link
Copy Markdown

drips-wave Bot commented Jun 2, 2026

@mikkyvans0-source Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@Baskarayelu Baskarayelu merged commit 88741b3 into QuickLendX:main Jun 3, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend: Add ingest-lag alerting thresholds and degraded-mode auto-recovery to lagMonitor

2 participants