QuickLendX · Baskarayelu · Jun 3, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/backend/docs/health.md b/backend/docs/health.md
@@ -0,0 +1,132 @@
+# Health, Liveness, and Readiness Probes
+
+The backend exposes two distinct kinds of health signal. They answer different
+questions and orchestrators (Kubernetes, ECS, Nomad, …) act on them
+differently. Conflating them — as a single flat `/health` that always returns
+`ok` — causes traffic to be routed to instances that are up but unable to serve.
+
+All probes are mounted at the **root** of the app (not under `/api/v1`) and are
+**unauthenticated**, because orchestrators probe them without credentials.
+
+| Endpoint   | Kind      | Cost  | Checks dependencies | Healthy | Unhealthy |
+| ---------- | --------- | ----- | ------------------- | ------- | --------- |
+| `/health`  | Liveness  | cheap | no                  | `200`   | —         |
+| `/livez`   | Liveness  | cheap | no                  | `200`   | —         |
+| `/readyz`  | Readiness | real  | yes                 | `200`   | `503`     |
+
+## Liveness — `/health`, `/livez`
+
+> "Is the process up and able to serve an HTTP request at all?"
+
+Liveness is cheap and **dependency-free**. It returns `200` whenever the event
+loop can service a request. A failing liveness probe instructs the orchestrator
+to **restart** the container, so it must never consult downstream dependencies —
+a transient database blip should not trigger a restart loop.
+
+`/health` is retained for backward compatibility; `/livez` is the conventional
+alias. They are identical.
+
+```json
+{ "status": "ok", "timestamp": "2026-06-02T12:00:00.000Z" }
+```
+
+## Readiness — `/readyz`
+
+> "Should this instance receive traffic right now?"
+
+Readiness probes real dependencies. A failing readiness probe pulls the instance
+out of the load-balancer rotation **without restarting it**, so it can recover
+and rejoin once its dependencies are healthy again.
+
+It returns:
+
+- `200` with `status: "ready"` when the instance can serve traffic.
+- `503` with `status: "not_ready"` when a hard dependency is unavailable.
+- `503` with `status: "maintenance"` when maintenance mode is enabled.
+
+```json
+{
+  "status": "ready",
+  "database": "ok",
+  "ingest": "ok",
+  "webhookQueue": "ok",
+  "timestamp": "2026-06-02T12:00:00.000Z"
+}
+```
+
+### Sub-status semantics
+
+Each dependency reports a coarse `SubStatus`, the same pattern used by
+`/api/v1/admin/monitoring`:
+
+- `ok` — healthy.
+- `degraded` — serving but impaired. **Does not** fail readiness.
+- `unavailable` — could not be reached / unusable. **Fails** readiness (`503`).
+
+| Sub-status     | Probe                          | `degraded` when                            | `unavailable` when                          |
+| -------------- | ------------------------------ | ------------------------------------------ | ------------------------------------------- |
+| `database`     | `pingDatabase()` (`SELECT 1`)  | —                                          | connection cannot open or execute           |
+| `ingest`       | `lagMonitor.getLagStatus()`    | lag ≥ warn threshold, < critical threshold | lag ≥ critical threshold, or probe throws   |
+| `webhookQueue` | `webhookQueueService.getStats()` | queue is saturated (`size ≥ capacity`)   | the queue's backing store is unreachable    |
+
+Ingest lag thresholds are governed by `LagMonitor` and configurable via
+`LAG_WARN_THRESHOLD` / `LAG_CRITICAL_THRESHOLD`. See [reliability.md](./reliability.md).
+
+The instance is **not ready** (`503`) if *any* sub-status is `unavailable`.
+`degraded` sub-statuses are surfaced for observability but keep the instance in
+rotation: a slightly stale index or a back-pressured queue is still serviceable.
+
+### Maintenance mode
+
+When `statusService.isMaintenanceEnabled()` is true, `/readyz` short-circuits
+**before** probing any dependency and returns `503` with `status: "maintenance"`.
+The instance is intentionally not serving and should be pulled from rotation.
+Liveness is unaffected — the process is healthy, just drained.
+
+## Edge-case behaviour
+
+| Scenario                     | `/health`, `/livez` | `/readyz`                                       |
+| ---------------------------- | ------------------- | ----------------------------------------------- |
+| All healthy                  | `200 ok`            | `200 ready`                                     |
+| Database down                | `200 ok`            | `503 not_ready`, `database: unavailable`        |
+| Warn-level lag               | `200 ok`            | `200 ready`, `ingest: degraded`                 |
+| Critical lag                 | `200 ok`            | `503 not_ready`, `ingest: unavailable`          |
+| Lag probe throws             | `200 ok`            | `503 not_ready`, `ingest: unavailable`          |
+| Queue store unreachable      | `200 ok`            | `503 not_ready`, `webhookQueue: unavailable`    |
+| Queue saturated              | `200 ok`            | `200 ready`, `webhookQueue: degraded`           |
+| Maintenance mode             | `200 ok`            | `503 maintenance`                               |
+| Partial failure (one dep)    | `200 ok`            | `503 not_ready` (failing dep `unavailable`, rest `ok`) |
+
+## Security
+
+These probes are unauthenticated, so their responses are deliberately minimal.
+They expose only the coarse status enums above and a timestamp. They do **not**
+leak:
+
+- internal hostnames or connection strings,
+- application or dependency versions,
+- absolute ledger numbers or ingest-lag values,
+- queue depths or capacities,
+- underlying exception messages (dependency errors are caught and collapsed to
+  `unavailable`).
+
+Richer, sensitive diagnostics (queue depths, invariant counters, cursor
+positions, versions) remain behind API-key auth at
+[`/api/v1/admin/monitoring`](./admin-monitoring.md).
+
+## Orchestrator configuration (Kubernetes example)
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /livez
+    port: 3000
+  initialDelaySeconds: 5
+  periodSeconds: 10
+readinessProbe:
+  httpGet:
+    path: /readyz
+    port: 3000
+  initialDelaySeconds: 5
+  periodSeconds: 5
+```
diff --git a/backend/src/app.ts b/backend/src/app.ts
@@ -9,6 +9,7 @@ import { csrfMiddleware } from "./middleware/csrf";
 import { corsOptionsDelegate, webhookCorsOptions } from "./config/cors";
 import v1Routes from "./routes/v1";
 import webhookRoutes from "./routes/webhooks";
+import healthRoutes from "./routes/health";
 import { requestLogger } from "./middleware/request-logger";
 
 const app = express();
@@ -47,14 +48,9 @@ app.use("/api/webhooks", cors(webhookCorsOptions), webhookRoutes);
 app.use(csrfMiddleware);
 app.use("/api/v1", v1Routes);
 
-// Health check (root level as well if needed)
-app.get("/health", (req, res) => {
-  res.json({
-    status: "ok",
-    version: "1.0.0",
-    timestamp: new Date().toISOString(),
-  });
-});
+// Liveness (/health, /livez) and readiness (/readyz) probes.
+// Mounted at the root and left unauthenticated so orchestrators can probe them.
+app.use(healthRoutes);
 
 // 404 handler
 app.use((req, res) => {

diff --git a/backend/src/lib/database.ts b/backend/src/lib/database.ts
@@ -77,6 +77,27 @@ export function getStatementCacheStats() {
   };
 }
 
+/**
+ * Probe database connectivity with a trivial round-trip query.
+ *
+ * Used by the readiness endpoint to verify the SQLite connection can both
+ * open and execute. Returns true on success, false on any failure (a locked,
+ * corrupt, or unopenable database). Never throws so callers can branch on the
+ * boolean without their own try/catch.
+ *
+ * The query (`SELECT 1`) is constant and parameter-free, so it carries no
+ * user input and leaks no schema details.
+ */
+export function pingDatabase(): boolean {
+  try {
+    const db = getDatabase();
+    const row = db.prepare("SELECT 1 AS ok").get();
+    return row?.ok === 1;
+  } catch {
+    return false;
+  }
+}
+
 /**
  * Close the database connection and clear the statement cache.
  * Ensures clean shutdown and prevents memory leaks.

diff --git a/backend/src/routes/health.ts b/backend/src/routes/health.ts
@@ -0,0 +1,131 @@
+/**
+ * Liveness and readiness probes.
+ *
+ * These endpoints are intentionally mounted at the root of the app (not under
+ * /api/v1) and are unauthenticated, because container orchestrators (Kubernetes,
+ * ECS, Nomad, …) probe them without credentials.
+ *
+ * Two distinct concerns:
+ *
+ *   GET /health, GET /livez  — Liveness. "Is the process up and able to serve
+ *     an HTTP request at all?" Cheap and dependency-free. A failing liveness
+ *     probe tells the orchestrator to restart the container, so it must NOT
+ *     consult downstream dependencies — a transient DB blip should not trigger
+ *     a restart loop.
+ *
+ *   GET /readyz — Readiness. "Should this instance receive traffic right now?"
+ *     Probes real dependencies (DB connectivity, ingest lag, webhook queue) and
+ *     returns 503 when any hard dependency is unavailable or when maintenance
+ *     mode is enabled. A failing readiness probe pulls the instance out of the
+ *     load-balancer rotation without restarting it.
+ *
+ * Security: responses expose only coarse status enums per sub-system. They do
+ * not leak internal hostnames, versions, queue depths, ledger numbers, or error
+ * messages to unauthenticated callers. The richer, authenticated diagnostics
+ * remain under /api/v1/admin/monitoring.
+ */
+
+import { Router, Request, Response } from "express";
+import { pingDatabase } from "../lib/database";
+import { statusService } from "../services/statusService";
+import { lagMonitor } from "../services/lagMonitor";
+import { webhookQueueService } from "../services/webhookQueueService";
+
+const router = Router();
+
+/**
+ * Coarse per-dependency status, mirroring the SubStatus pattern used by
+ * /api/v1/admin/monitoring. "degraded" means serving but impaired (does not
+ * fail readiness); "unavailable" means the dependency could not be reached
+ * (fails readiness).
+ */
+type SubStatus = "ok" | "degraded" | "unavailable";
+
+type ReadyStatus = "ready" | "not_ready" | "maintenance";
+
+// ---------------------------------------------------------------------------
+// Liveness
+// ---------------------------------------------------------------------------
+
+function liveness(_req: Request, res: Response): void {
+  res.json({
+    status: "ok",
+    timestamp: new Date().toISOString(),
+  });
+}
+
+// Keep the historical /health path as a liveness check, and add the
+// conventional /livez alias.
+router.get("/health", liveness);
+router.get("/livez", liveness);
+
+// ---------------------------------------------------------------------------
+// Readiness
+// ---------------------------------------------------------------------------
+
+router.get("/readyz", async (_req: Request, res: Response) => {
+  // Maintenance mode short-circuits readiness: the instance is intentionally
+  // not serving, so it should be pulled from rotation regardless of deps.
+  if (statusService.isMaintenanceEnabled()) {
+    res.status(503).json({
+      status: "maintenance" as ReadyStatus,
+      database: "ok" as SubStatus,
+      ingest: "ok" as SubStatus,
+      webhookQueue: "ok" as SubStatus,
+      timestamp: new Date().toISOString(),
+    });
+    return;
+  }
+
+  // --- Database connectivity (hard dependency) ---------------------------
+  let database: SubStatus = "ok";
+  if (!pingDatabase()) {
+    database = "unavailable";
+  }
+
+  // --- Ingest lag --------------------------------------------------------
+  // Reuse the LagMonitor degradation logic. "warn" lag is degraded but still
+  // serviceable; "critical" lag means the indexed view is too stale to trust,
+  // so we treat it as unavailable for readiness.
+  let ingest: SubStatus = "ok";
+  try {
+    const lag = await lagMonitor.getLagStatus();
+    if (lag.isCritical) {
+      ingest = "unavailable";
+    } else if (lag.isDegraded) {
+      ingest = "degraded";
+    }
+  } catch {
+    ingest = "unavailable";
+  }
+
+  // --- Webhook queue health (hard dependency on its backing store) -------
+  // A throw here means the queue's store is unreachable. Saturation (queue at
+  // capacity) is back-pressure, not unreadiness, so it is reported as degraded.
+  let webhookQueue: SubStatus = "ok";
+  try {
+    const stats = webhookQueueService.getStats();
+    if (stats.capacity > 0 && stats.size >= stats.capacity) {
+      webhookQueue = "degraded";
+    }
+  } catch {
+    webhookQueue = "unavailable";
+  }
+
+  const unavailable =
+    database === "unavailable" ||
+    ingest === "unavailable" ||
+    webhookQueue === "unavailable";
+
+  const status: ReadyStatus = unavailable ? "not_ready" : "ready";
+
+  res.status(unavailable ? 503 : 200).json({
+    status,
+    database,
+    ingest,
+    webhookQueue,
+    timestamp: new Date().toISOString(),
+  });
+});
+
+export default router;