QuickLendX · Baskarayelu · Jun 3, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/backend/docs/health.md b/backend/docs/health.md
@@ -0,0 +1,132 @@
+# Health, Liveness, and Readiness Probes
+
+The backend exposes two distinct kinds of health signal. They answer different
+questions and orchestrators (Kubernetes, ECS, Nomad, …) act on them
+differently. Conflating them — as a single flat `/health` that always returns
+`ok` — causes traffic to be routed to instances that are up but unable to serve.
+
+All probes are mounted at the **root** of the app (not under `/api/v1`) and are
+**unauthenticated**, because orchestrators probe them without credentials.
+
+| Endpoint   | Kind      | Cost  | Checks dependencies | Healthy | Unhealthy |
+| ---------- | --------- | ----- | ------------------- | ------- | --------- |
+| `/health`  | Liveness  | cheap | no                  | `200`   | —         |
+| `/livez`   | Liveness  | cheap | no                  | `200`   | —         |
+| `/readyz`  | Readiness | real  | yes                 | `200`   | `503`     |
+
+## Liveness — `/health`, `/livez`
+
+> "Is the process up and able to serve an HTTP request at all?"
+
+Liveness is cheap and **dependency-free**. It returns `200` whenever the event
+loop can service a request. A failing liveness probe instructs the orchestrator
+to **restart** the container, so it must never consult downstream dependencies —
+a transient database blip should not trigger a restart loop.
+
+`/health` is retained for backward compatibility; `/livez` is the conventional
+alias. They are identical.
+
+```json
+{ "status": "ok", "timestamp": "2026-06-02T12:00:00.000Z" }
+```
+
+## Readiness — `/readyz`
+
+> "Should this instance receive traffic right now?"
+
+Readiness probes real dependencies. A failing readiness probe pulls the instance
+out of the load-balancer rotation **without restarting it**, so it can recover
+and rejoin once its dependencies are healthy again.
+
+It returns:
+
+- `200` with `status: "ready"` when the instance can serve traffic.
+- `503` with `status: "not_ready"` when a hard dependency is unavailable.
+- `503` with `status: "maintenance"` when maintenance mode is enabled.
+
+```json
+{
+  "status": "ready",
+  "database": "ok",
+  "ingest": "ok",
+  "webhookQueue": "ok",
+  "timestamp": "2026-06-02T12:00:00.000Z"
+}
+```
+
+### Sub-status semantics
+
+Each dependency reports a coarse `SubStatus`, the same pattern used by
+`/api/v1/admin/monitoring`:
+
+- `ok` — healthy.
+- `degraded` — serving but impaired. **Does not** fail readiness.
+- `unavailable` — could not be reached / unusable. **Fails** readiness (`503`).
+
+| Sub-status     | Probe                          | `degraded` when                            | `unavailable` when                          |
+| -------------- | ------------------------------ | ------------------------------------------ | ------------------------------------------- |
+| `database`     | `pingDatabase()` (`SELECT 1`)  | —                                          | connection cannot open or execute           |
+| `ingest`       | `lagMonitor.getLagStatus()`    | lag ≥ warn threshold, < critical threshold | lag ≥ critical threshold, or probe throws   |
+| `webhookQueue` | `webhookQueueService.getStats()` | queue is saturated (`size ≥ capacity`)   | the queue's backing store is unreachable    |
+
+Ingest lag thresholds are governed by `LagMonitor` and configurable via
+`LAG_WARN_THRESHOLD` / `LAG_CRITICAL_THRESHOLD`. See [reliability.md](./reliability.md).
+
+The instance is **not ready** (`503`) if *any* sub-status is `unavailable`.
+`degraded` sub-statuses are surfaced for observability but keep the instance in
+rotation: a slightly stale index or a back-pressured queue is still serviceable.
+
+### Maintenance mode
+
+When `statusService.isMaintenanceEnabled()` is true, `/readyz` short-circuits
+**before** probing any dependency and returns `503` with `status: "maintenance"`.
+The instance is intentionally not serving and should be pulled from rotation.
+Liveness is unaffected — the process is healthy, just drained.
+
+## Edge-case behaviour
+
+| Scenario                     | `/health`, `/livez` | `/readyz`                                       |
+| ---------------------------- | ------------------- | ----------------------------------------------- |
+| All healthy                  | `200 ok`            | `200 ready`                                     |
+| Database down                | `200 ok`            | `503 not_ready`, `database: unavailable`        |
+| Warn-level lag               | `200 ok`            | `200 ready`, `ingest: degraded`                 |
+| Critical lag                 | `200 ok`            | `503 not_ready`, `ingest: unavailable`          |
+| Lag probe throws             | `200 ok`            | `503 not_ready`, `ingest: unavailable`          |
+| Queue store unreachable      | `200 ok`            | `503 not_ready`, `webhookQueue: unavailable`    |
+| Queue saturated              | `200 ok`            | `200 ready`, `webhookQueue: degraded`           |
+| Maintenance mode             | `200 ok`            | `503 maintenance`                               |
+| Partial failure (one dep)    | `200 ok`            | `503 not_ready` (failing dep `unavailable`, rest `ok`) |
+
+## Security
+
+These probes are unauthenticated, so their responses are deliberately minimal.
+They expose only the coarse status enums above and a timestamp. They do **not**
+leak:
+
+- internal hostnames or connection strings,
+- application or dependency versions,
+- absolute ledger numbers or ingest-lag values,
+- queue depths or capacities,
+- underlying exception messages (dependency errors are caught and collapsed to
+  `unavailable`).
+
+Richer, sensitive diagnostics (queue depths, invariant counters, cursor
+positions, versions) remain behind API-key auth at
+[`/api/v1/admin/monitoring`](./admin-monitoring.md).
+
+## Orchestrator configuration (Kubernetes example)
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /livez
+    port: 3000
+  initialDelaySeconds: 5
+  periodSeconds: 10
+readinessProbe:
+  httpGet:
+    path: /readyz
+    port: 3000
+  initialDelaySeconds: 5
+  periodSeconds: 5
+```
diff --git a/backend/docs/observability.md b/backend/docs/observability.md
@@ -0,0 +1,177 @@
+# Observability — Ingest Lag Alerting
+
+This document describes how the QuickLendX backend turns indexer-lag threshold
+breaches into **alerts** and how **degraded mode auto-recovery** works. It
+complements [reliability.md](./reliability.md) (which covers how degraded mode
+gates writes) and [logging.md](./logging.md) (log redaction policy).
+
+The alerting logic lives in
+[`src/services/lagMonitor.ts`](../src/services/lagMonitor.ts).
+
+---
+
+## Overview
+
+`LagMonitor` computes indexer lag (in ledgers) as
+`current_ledger - last_indexed_ledger`. Two thresholds classify the lag into a
+**level**:
+
+| Level      | Condition                          | Effect                                 |
+| ---------- | ---------------------------------- | -------------------------------------- |
+| `none`     | `lag < warnThreshold`              | Healthy. All endpoints available.      |
+| `warn`     | `lag >= warnThreshold`             | Degraded. Write endpoints gated (503). |
+| `critical` | `lag >= criticalThreshold`         | Critically degraded. All writes blocked. |
+
+The level is consumed by:
+
+- **`GET /api/v1/status`** — surfaces the current level to clients.
+- **`degradedGuard`** middleware — gates write/sensitive endpoints.
+
+Prior to this feature, threshold breaches were silent (no operator signal) and
+recovery was implicit (a single good reading immediately re-opened the write
+guard, allowing it to flap). This feature adds **alerts on transitions** and
+**hysteresis-backed auto-recovery**.
+
+---
+
+## Thresholds & configuration
+
+All four parameters are configurable via environment variables. Defaults are
+chosen for a ~5s ledger cadence.
+
+| Env var                   | Default | Meaning                                                              |
+| ------------------------- | ------- | -------------------------------------------------------------------- |
+| `LAG_WARN_THRESHOLD`      | `10`    | Lag (ledgers) at which the system becomes degraded (`warn`).         |
+| `LAG_CRITICAL_THRESHOLD`  | `50`    | Lag (ledgers) at which the system becomes critically degraded.       |
+| `LAG_HYSTERESIS_MARGIN`   | `3`     | Ledgers **below** a threshold the lag must fall to before recovering.|
+| `LAG_RECOVERY_POLLS`      | `3`     | Consecutive recovered polls required before a degraded level clears. |
+
+Non-numeric or empty values fall back to the defaults. `recoveryPolls` is
+clamped to `>= 1` and `hysteresisMargin` to `>= 0`.
+
+Thresholds can also be set at runtime in tests/bootstrap via
+`setThresholds(warn, critical)` and `setHysteresis(margin, polls)`.
+
+---
+
+## Hysteresis & auto-recovery
+
+To stop the monitor flapping when lag hovers around a threshold, the monitor
+tracks an **effective level** separately from the **instantaneous level**
+computed from the raw lag:
+
+- **Escalation is immediate.** As soon as the raw lag reaches a higher level
+  (e.g. `lag >= criticalThreshold`), the effective level jumps there. This is
+  the fail-safe direction — a breach gates writes without delay.
+- **De-escalation is sustained.** To clear a level, the raw lag must fall to
+  the **recovery threshold** (`threshold - hysteresisMargin`) and stay there
+  for `recoveryPolls` consecutive polls. A single breach anywhere in that
+  window resets the streak. Recovery steps down **one level at a time**
+  (`critical → warn → none`) so the `warn` write-guard window is never skipped.
+
+Recovery thresholds with the defaults:
+
+- Recover out of `critical` → `warn` when `lag <= 50 - 3 = 47` for 3 polls.
+- Recover out of `warn` → `none` when `lag <= 10 - 3 = 7` for 3 polls.
+
+`getLagStatus()` (used by `/status` and `degradedGuard`) reports the effective
+level. It **escalates immediately** when called but **never auto-clears** — only
+the scheduled `poll()` path performs de-escalation. This means the many guard
+and status calls per interval can raise the level but can never lower it.
+
+---
+
+## Alert events
+
+Alerts are emitted **only on transitions** of the effective level — never on
+every poll. Each transition:
+
+1. Logs a single structured JSON line (`type: "LAG_ALERT"`), at `WARN` for
+   escalations and `INFO` for recoveries.
+2. Increments in-process counters (see [Metrics](#metrics)).
+3. Notifies any subscribers registered via `onAlert(listener)`.
+
+### Alert payload
+
+```jsonc
+{
+  "from": "warn",          // level moved away from
+  "to": "critical",        // level moved to
+  "direction": "escalation", // or "recovery"
+  "lag": 62,               // raw lag at transition (ledgers)
+  "warnThreshold": 10,
+  "criticalThreshold": 50,
+  "at": "2026-06-02T12:00:00.000Z"
+}
+```
+
+> **Security:** Alert payloads carry **only operational fields** — lag,
+> thresholds, level, timestamp. They never include request bodies, wallet
+> data, auth tokens, or any other secrets. The logged line uses the same
+> fixed shape, so no caller-supplied data can leak into log sinks.
+
+### Subscribing
+
+```ts
+import { lagMonitor } from "../services/lagMonitor";
+
+const unsubscribe = lagMonitor.onAlert((event) => {
+  // forward to PagerDuty / Slack / metrics exporter, etc.
+});
+```
+
+A throwing subscriber is isolated and never breaks the monitor.
+
+---
+
+## Polling
+
+`poll()` reads a fresh lag value, advances the hysteresis state machine, and
+emits any resulting transition alert. Schedule it on a fixed cadence (mirroring
+the invariant scheduler pattern):
+
+```ts
+import { lagMonitor } from "../services/lagMonitor";
+
+setInterval(() => {
+  void lagMonitor.poll();
+}, 5000);
+```
+
+Do **not** call `poll()` per request — request paths should call the read-only
+`getLagStatus()`.
+
+### Missing / corrupt current-ledger reads
+
+If the current-ledger read throws, or yields a non-finite or negative lag, the
+monitor **fails safe to `critical`** (it returns `lag = criticalThreshold`).
+An unknown reading must never silently clear a degraded state or open the write
+guard.
+
+---
+
+## Metrics
+
+`getAlertMetrics()` returns a defensive copy of the in-process counters,
+suitable for exposure via the monitoring endpoint or a scraper:
+
+| Field                      | Meaning                                                  |
+| -------------------------- | -------------------------------------------------------- |
+| `escalations`              | Total escalation transitions observed.                   |
+| `recoveries`               | Total recovery transitions observed.                     |
+| `transitionsTo`            | Transition count by destination level (`none`/`warn`/`critical`). |
+| `currentLevel`             | Current effective level.                                 |
+| `consecutiveRecoveryPolls` | Consecutive polls the lag has been within recovery range. |
+
+---
+
+## Edge cases (covered by tests)
+
+See [`src/tests/lagMonitor.alerts.test.ts`](../src/tests/lagMonitor.alerts.test.ts):
+
+- **Flapping** around a threshold produces no spurious transitions; a single
+  good poll never clears a degraded state.
+- **Sustained breach** holds the level with no duplicate alerts.
+- **Rapid recovery** still drains one level per recovery window, preserving the
+  `warn` guard window.
+- **Missing current-ledger read** fails safe to `critical`.
diff --git a/backend/src/app.ts b/backend/src/app.ts
@@ -9,6 +9,7 @@ import { csrfMiddleware } from "./middleware/csrf";
 import { corsOptionsDelegate, webhookCorsOptions } from "./config/cors";
 import v1Routes from "./routes/v1";
 import webhookRoutes from "./routes/webhooks";
+import healthRoutes from "./routes/health";
 import { requestLogger } from "./middleware/request-logger";
 
 const app = express();
@@ -47,14 +48,9 @@ app.use("/api/webhooks", cors(webhookCorsOptions), webhookRoutes);
 app.use(csrfMiddleware);
 app.use("/api/v1", v1Routes);
 
-// Health check (root level as well if needed)
-app.get("/health", (req, res) => {
-  res.json({
-    status: "ok",
-    version: "1.0.0",
-    timestamp: new Date().toISOString(),
-  });
-});
+// Liveness (/health, /livez) and readiness (/readyz) probes.
+// Mounted at the root and left unauthenticated so orchestrators can probe them.
+app.use(healthRoutes);
 
 // 404 handler
 app.use((req, res) => {

diff --git a/backend/src/lib/database.ts b/backend/src/lib/database.ts
@@ -77,6 +77,27 @@ export function getStatementCacheStats() {
   };
 }
 
+/**
+ * Probe database connectivity with a trivial round-trip query.
+ *
+ * Used by the readiness endpoint to verify the SQLite connection can both
+ * open and execute. Returns true on success, false on any failure (a locked,
+ * corrupt, or unopenable database). Never throws so callers can branch on the
+ * boolean without their own try/catch.
+ *
+ * The query (`SELECT 1`) is constant and parameter-free, so it carries no
+ * user input and leaks no schema details.
+ */
+export function pingDatabase(): boolean {
+  try {
+    const db = getDatabase();
+    const row = db.prepare("SELECT 1 AS ok").get();
+    return row?.ok === 1;
+  } catch {
+    return false;
+  }
+}
+
 /**
  * Close the database connection and clear the statement cache.
  * Ensures clean shutdown and prevents memory leaks.