From def6bce7c469fae5abcd3e3c3f8b99396328231b Mon Sep 17 00:00:00 2001 From: mikkyvans0-source Date: Tue, 2 Jun 2026 12:02:49 +0100 Subject: [PATCH 1/2] feat: split liveness and readiness probes with dependency checks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages. Co-Authored-By: Claude Opus 4.8 --- backend/docs/health.md | 132 ++++++++++++++ backend/src/app.ts | 12 +- backend/src/lib/database.ts | 21 +++ backend/src/routes/health.ts | 131 ++++++++++++++ backend/src/tests/readiness.test.ts | 257 ++++++++++++++++++++++++++++ 5 files changed, 545 insertions(+), 8 deletions(-) create mode 100644 backend/docs/health.md create mode 100644 backend/src/routes/health.ts create mode 100644 backend/src/tests/readiness.test.ts diff --git a/backend/docs/health.md b/backend/docs/health.md new file mode 100644 index 00000000..a9d2df7b --- /dev/null +++ b/backend/docs/health.md @@ -0,0 +1,132 @@ +# Health, Liveness, and Readiness Probes + +The backend exposes two distinct kinds of health signal. They answer different +questions and orchestrators (Kubernetes, ECS, Nomad, …) act on them +differently. Conflating them — as a single flat `/health` that always returns +`ok` — causes traffic to be routed to instances that are up but unable to serve. + +All probes are mounted at the **root** of the app (not under `/api/v1`) and are +**unauthenticated**, because orchestrators probe them without credentials. + +| Endpoint | Kind | Cost | Checks dependencies | Healthy | Unhealthy | +| ---------- | --------- | ----- | ------------------- | ------- | --------- | +| `/health` | Liveness | cheap | no | `200` | — | +| `/livez` | Liveness | cheap | no | `200` | — | +| `/readyz` | Readiness | real | yes | `200` | `503` | + +## Liveness — `/health`, `/livez` + +> "Is the process up and able to serve an HTTP request at all?" + +Liveness is cheap and **dependency-free**. It returns `200` whenever the event +loop can service a request. A failing liveness probe instructs the orchestrator +to **restart** the container, so it must never consult downstream dependencies — +a transient database blip should not trigger a restart loop. + +`/health` is retained for backward compatibility; `/livez` is the conventional +alias. They are identical. + +```json +{ "status": "ok", "timestamp": "2026-06-02T12:00:00.000Z" } +``` + +## Readiness — `/readyz` + +> "Should this instance receive traffic right now?" + +Readiness probes real dependencies. A failing readiness probe pulls the instance +out of the load-balancer rotation **without restarting it**, so it can recover +and rejoin once its dependencies are healthy again. + +It returns: + +- `200` with `status: "ready"` when the instance can serve traffic. +- `503` with `status: "not_ready"` when a hard dependency is unavailable. +- `503` with `status: "maintenance"` when maintenance mode is enabled. + +```json +{ + "status": "ready", + "database": "ok", + "ingest": "ok", + "webhookQueue": "ok", + "timestamp": "2026-06-02T12:00:00.000Z" +} +``` + +### Sub-status semantics + +Each dependency reports a coarse `SubStatus`, the same pattern used by +`/api/v1/admin/monitoring`: + +- `ok` — healthy. +- `degraded` — serving but impaired. **Does not** fail readiness. +- `unavailable` — could not be reached / unusable. **Fails** readiness (`503`). + +| Sub-status | Probe | `degraded` when | `unavailable` when | +| -------------- | ------------------------------ | ------------------------------------------ | ------------------------------------------- | +| `database` | `pingDatabase()` (`SELECT 1`) | — | connection cannot open or execute | +| `ingest` | `lagMonitor.getLagStatus()` | lag ≥ warn threshold, < critical threshold | lag ≥ critical threshold, or probe throws | +| `webhookQueue` | `webhookQueueService.getStats()` | queue is saturated (`size ≥ capacity`) | the queue's backing store is unreachable | + +Ingest lag thresholds are governed by `LagMonitor` and configurable via +`LAG_WARN_THRESHOLD` / `LAG_CRITICAL_THRESHOLD`. See [reliability.md](./reliability.md). + +The instance is **not ready** (`503`) if *any* sub-status is `unavailable`. +`degraded` sub-statuses are surfaced for observability but keep the instance in +rotation: a slightly stale index or a back-pressured queue is still serviceable. + +### Maintenance mode + +When `statusService.isMaintenanceEnabled()` is true, `/readyz` short-circuits +**before** probing any dependency and returns `503` with `status: "maintenance"`. +The instance is intentionally not serving and should be pulled from rotation. +Liveness is unaffected — the process is healthy, just drained. + +## Edge-case behaviour + +| Scenario | `/health`, `/livez` | `/readyz` | +| ---------------------------- | ------------------- | ----------------------------------------------- | +| All healthy | `200 ok` | `200 ready` | +| Database down | `200 ok` | `503 not_ready`, `database: unavailable` | +| Warn-level lag | `200 ok` | `200 ready`, `ingest: degraded` | +| Critical lag | `200 ok` | `503 not_ready`, `ingest: unavailable` | +| Lag probe throws | `200 ok` | `503 not_ready`, `ingest: unavailable` | +| Queue store unreachable | `200 ok` | `503 not_ready`, `webhookQueue: unavailable` | +| Queue saturated | `200 ok` | `200 ready`, `webhookQueue: degraded` | +| Maintenance mode | `200 ok` | `503 maintenance` | +| Partial failure (one dep) | `200 ok` | `503 not_ready` (failing dep `unavailable`, rest `ok`) | + +## Security + +These probes are unauthenticated, so their responses are deliberately minimal. +They expose only the coarse status enums above and a timestamp. They do **not** +leak: + +- internal hostnames or connection strings, +- application or dependency versions, +- absolute ledger numbers or ingest-lag values, +- queue depths or capacities, +- underlying exception messages (dependency errors are caught and collapsed to + `unavailable`). + +Richer, sensitive diagnostics (queue depths, invariant counters, cursor +positions, versions) remain behind API-key auth at +[`/api/v1/admin/monitoring`](./admin-monitoring.md). + +## Orchestrator configuration (Kubernetes example) + +```yaml +livenessProbe: + httpGet: + path: /livez + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 10 +readinessProbe: + httpGet: + path: /readyz + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 5 +``` diff --git a/backend/src/app.ts b/backend/src/app.ts index 08350047..99d3fb48 100644 --- a/backend/src/app.ts +++ b/backend/src/app.ts @@ -9,6 +9,7 @@ import { csrfMiddleware } from "./middleware/csrf"; import { corsOptionsDelegate, webhookCorsOptions } from "./config/cors"; import v1Routes from "./routes/v1"; import webhookRoutes from "./routes/webhooks"; +import healthRoutes from "./routes/health"; import { requestLogger } from "./middleware/request-logger"; const app = express(); @@ -47,14 +48,9 @@ app.use("/api/webhooks", cors(webhookCorsOptions), webhookRoutes); app.use(csrfMiddleware); app.use("/api/v1", v1Routes); -// Health check (root level as well if needed) -app.get("/health", (req, res) => { - res.json({ - status: "ok", - version: "1.0.0", - timestamp: new Date().toISOString(), - }); -}); +// Liveness (/health, /livez) and readiness (/readyz) probes. +// Mounted at the root and left unauthenticated so orchestrators can probe them. +app.use(healthRoutes); // 404 handler app.use((req, res) => { diff --git a/backend/src/lib/database.ts b/backend/src/lib/database.ts index 979cb367..6e444921 100644 --- a/backend/src/lib/database.ts +++ b/backend/src/lib/database.ts @@ -77,6 +77,27 @@ export function getStatementCacheStats() { }; } +/** + * Probe database connectivity with a trivial round-trip query. + * + * Used by the readiness endpoint to verify the SQLite connection can both + * open and execute. Returns true on success, false on any failure (a locked, + * corrupt, or unopenable database). Never throws so callers can branch on the + * boolean without their own try/catch. + * + * The query (`SELECT 1`) is constant and parameter-free, so it carries no + * user input and leaks no schema details. + */ +export function pingDatabase(): boolean { + try { + const db = getDatabase(); + const row = db.prepare("SELECT 1 AS ok").get(); + return row?.ok === 1; + } catch { + return false; + } +} + /** * Close the database connection and clear the statement cache. * Ensures clean shutdown and prevents memory leaks. diff --git a/backend/src/routes/health.ts b/backend/src/routes/health.ts new file mode 100644 index 00000000..a1da1199 --- /dev/null +++ b/backend/src/routes/health.ts @@ -0,0 +1,131 @@ +/** + * Liveness and readiness probes. + * + * These endpoints are intentionally mounted at the root of the app (not under + * /api/v1) and are unauthenticated, because container orchestrators (Kubernetes, + * ECS, Nomad, …) probe them without credentials. + * + * Two distinct concerns: + * + * GET /health, GET /livez — Liveness. "Is the process up and able to serve + * an HTTP request at all?" Cheap and dependency-free. A failing liveness + * probe tells the orchestrator to restart the container, so it must NOT + * consult downstream dependencies — a transient DB blip should not trigger + * a restart loop. + * + * GET /readyz — Readiness. "Should this instance receive traffic right now?" + * Probes real dependencies (DB connectivity, ingest lag, webhook queue) and + * returns 503 when any hard dependency is unavailable or when maintenance + * mode is enabled. A failing readiness probe pulls the instance out of the + * load-balancer rotation without restarting it. + * + * Security: responses expose only coarse status enums per sub-system. They do + * not leak internal hostnames, versions, queue depths, ledger numbers, or error + * messages to unauthenticated callers. The richer, authenticated diagnostics + * remain under /api/v1/admin/monitoring. + */ + +import { Router, Request, Response } from "express"; +import { pingDatabase } from "../lib/database"; +import { statusService } from "../services/statusService"; +import { lagMonitor } from "../services/lagMonitor"; +import { webhookQueueService } from "../services/webhookQueueService"; + +const router = Router(); + +/** + * Coarse per-dependency status, mirroring the SubStatus pattern used by + * /api/v1/admin/monitoring. "degraded" means serving but impaired (does not + * fail readiness); "unavailable" means the dependency could not be reached + * (fails readiness). + */ +type SubStatus = "ok" | "degraded" | "unavailable"; + +type ReadyStatus = "ready" | "not_ready" | "maintenance"; + +// --------------------------------------------------------------------------- +// Liveness +// --------------------------------------------------------------------------- + +function liveness(_req: Request, res: Response): void { + res.json({ + status: "ok", + timestamp: new Date().toISOString(), + }); +} + +// Keep the historical /health path as a liveness check, and add the +// conventional /livez alias. +router.get("/health", liveness); +router.get("/livez", liveness); + +// --------------------------------------------------------------------------- +// Readiness +// --------------------------------------------------------------------------- + +router.get("/readyz", async (_req: Request, res: Response) => { + // Maintenance mode short-circuits readiness: the instance is intentionally + // not serving, so it should be pulled from rotation regardless of deps. + if (statusService.isMaintenanceEnabled()) { + res.status(503).json({ + status: "maintenance" as ReadyStatus, + database: "ok" as SubStatus, + ingest: "ok" as SubStatus, + webhookQueue: "ok" as SubStatus, + timestamp: new Date().toISOString(), + }); + return; + } + + // --- Database connectivity (hard dependency) --------------------------- + let database: SubStatus = "ok"; + if (!pingDatabase()) { + database = "unavailable"; + } + + // --- Ingest lag -------------------------------------------------------- + // Reuse the LagMonitor degradation logic. "warn" lag is degraded but still + // serviceable; "critical" lag means the indexed view is too stale to trust, + // so we treat it as unavailable for readiness. + let ingest: SubStatus = "ok"; + try { + const lag = await lagMonitor.getLagStatus(); + if (lag.isCritical) { + ingest = "unavailable"; + } else if (lag.isDegraded) { + ingest = "degraded"; + } + } catch { + ingest = "unavailable"; + } + + // --- Webhook queue health (hard dependency on its backing store) ------- + // A throw here means the queue's store is unreachable. Saturation (queue at + // capacity) is back-pressure, not unreadiness, so it is reported as degraded. + let webhookQueue: SubStatus = "ok"; + try { + const stats = webhookQueueService.getStats(); + if (stats.capacity > 0 && stats.size >= stats.capacity) { + webhookQueue = "degraded"; + } + } catch { + webhookQueue = "unavailable"; + } + + const unavailable = + database === "unavailable" || + ingest === "unavailable" || + webhookQueue === "unavailable"; + + const status: ReadyStatus = unavailable ? "not_ready" : "ready"; + + res.status(unavailable ? 503 : 200).json({ + status, + database, + ingest, + webhookQueue, + timestamp: new Date().toISOString(), + }); +}); + +export default router; diff --git a/backend/src/tests/readiness.test.ts b/backend/src/tests/readiness.test.ts new file mode 100644 index 00000000..68dec63f --- /dev/null +++ b/backend/src/tests/readiness.test.ts @@ -0,0 +1,257 @@ +/** + * Liveness and readiness probe tests. + * + * Covers: + * - Liveness (/health, /livez) is cheap, always 200, dependency-free. + * - Readiness (/readyz) probes DB connectivity, ingest lag, and the webhook + * queue, and honours maintenance mode. + * - Edge cases: DB down, high (critical) lag, maintenance mode, partial + * dependency failure, queue saturation. + * - Security: probes do not leak internal hostnames, versions, or error + * details to unauthenticated callers. + */ + +import express from "express"; +import supertest from "supertest"; +import healthRoutes from "../routes/health"; +import { statusService } from "../services/statusService"; +import { lagMonitor } from "../services/lagMonitor"; +import { webhookQueueService } from "../services/webhookQueueService"; +import * as database from "../lib/database"; + +// Mount the health router the same way app.ts does: at the root, with no auth. +// Probes are unauthenticated, so no X-API-Key header is sent anywhere here. +// (We mount the router in isolation rather than importing the full app so the +// probe behaviour is exercised independently of the rest of the route graph.) +const app = express(); +app.use(express.json()); +app.use(healthRoutes); + +const HEALTHY_QUEUE_STATS = { + depth: 0, + size: 0, + capacity: 5000, + overflowCount: 0, + pendingCount: 0, + successCount: 0, + failureCount: 0, + oldestTimestamp: null, +}; + +beforeEach(() => { + // Healthy baseline: maintenance off, lag well under the warn threshold. + statusService.setMaintenanceMode(false); + statusService.updateLastIndexedLedger(100000); + statusService.setMockCurrentLedger(100002); // lag = 2 + + // The test database has no webhook_queue schema, so stub the queue stats to + // a healthy value by default. Individual tests override this to exercise the + // saturated / unavailable paths. This keeps the suite focused on probe logic + // rather than queue persistence (covered by webhookQueue.persist.test.ts). + jest + .spyOn(webhookQueueService, "getStats") + .mockReturnValue(HEALTHY_QUEUE_STATS as any); +}); + +afterEach(() => { + statusService.setMaintenanceMode(false); + statusService.setMockCurrentLedger(null); + jest.restoreAllMocks(); +}); + +// --------------------------------------------------------------------------- +// Liveness +// --------------------------------------------------------------------------- + +describe("Liveness probe", () => { + it("GET /health returns 200 with status ok", async () => { + const res = await supertest(app).get("/health"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + expect(res.body.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + it("GET /livez returns 200 with status ok", async () => { + const res = await supertest(app).get("/livez"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + }); + + it("liveness stays up even when a dependency is down", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + const res = await supertest(app).get("/health"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + }); + + it("liveness is dependency-free (does not call pingDatabase)", async () => { + const spy = jest.spyOn(database, "pingDatabase"); + await supertest(app).get("/livez"); + expect(spy).not.toHaveBeenCalled(); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — happy path +// --------------------------------------------------------------------------- + +describe("Readiness probe — ready", () => { + it("GET /readyz returns 200 when all dependencies are healthy", async () => { + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.database).toBe("ok"); + expect(res.body.ingest).toBe("ok"); + expect(res.body.webhookQueue).toBe("ok"); + expect(res.body.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + it("stays ready with warn-level (degraded) lag", async () => { + statusService.setMockCurrentLedger(100020); // lag = 20, >= warn(10), < critical(50) + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.ingest).toBe("degraded"); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — maintenance mode +// --------------------------------------------------------------------------- + +describe("Readiness probe — maintenance mode", () => { + it("returns 503 with maintenance status when maintenance is enabled", async () => { + statusService.setMaintenanceMode(true); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("maintenance"); + }); + + it("short-circuits before probing dependencies", async () => { + statusService.setMaintenanceMode(true); + const spy = jest.spyOn(database, "pingDatabase"); + await supertest(app).get("/readyz"); + expect(spy).not.toHaveBeenCalled(); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — dependency failures +// --------------------------------------------------------------------------- + +describe("Readiness probe — DB down", () => { + it("returns 503 not_ready when the database is unreachable", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.database).toBe("unavailable"); + }); +}); + +describe("Readiness probe — high lag", () => { + it("returns 503 not_ready when ingest lag is critical", async () => { + statusService.setMockCurrentLedger(100100); // lag = 100, >= critical(50) + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.ingest).toBe("unavailable"); + }); + + it("returns 503 not_ready when the lag probe throws", async () => { + jest.spyOn(lagMonitor, "getLagStatus").mockRejectedValue(new Error("rpc down")); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.ingest).toBe("unavailable"); + }); +}); + +describe("Readiness probe — webhook queue", () => { + it("returns 503 not_ready when the queue store is unreachable", async () => { + jest.spyOn(webhookQueueService, "getStats").mockImplementation(() => { + throw new Error("queue store unavailable"); + }); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.webhookQueue).toBe("unavailable"); + }); + + it("stays ready (degraded) when the queue is saturated", async () => { + jest.spyOn(webhookQueueService, "getStats").mockReturnValue({ + depth: 5000, + size: 5000, + capacity: 5000, + overflowCount: 3, + pendingCount: 5000, + successCount: 0, + failureCount: 0, + oldestTimestamp: null, + } as any); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.webhookQueue).toBe("degraded"); + }); +}); + +describe("Readiness probe — partial dependency failure", () => { + it("a single unavailable dependency fails readiness while others stay ok", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + statusService.setMockCurrentLedger(100002); // lag = 2, healthy + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.database).toBe("unavailable"); + expect(res.body.ingest).toBe("ok"); + expect(res.body.webhookQueue).toBe("ok"); + }); + + it("degraded lag plus DB down still reports both sub-statuses", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + statusService.setMockCurrentLedger(100020); // lag = 20, degraded + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.database).toBe("unavailable"); + expect(res.body.ingest).toBe("degraded"); + }); +}); + +// --------------------------------------------------------------------------- +// Security — no information leakage to unauthenticated callers +// --------------------------------------------------------------------------- + +describe("Readiness probe — does not leak internal details", () => { + const sub = ["ok", "degraded", "unavailable"]; + + it("readiness response contains only coarse status fields", async () => { + const res = await supertest(app).get("/readyz"); + expect(Object.keys(res.body).sort()).toEqual( + ["database", "ingest", "status", "timestamp", "webhookQueue"].sort() + ); + // No version, hostname, ledger numbers, queue depths, or error strings. + expect(res.body).not.toHaveProperty("version"); + expect(res.body).not.toHaveProperty("host"); + expect(res.body).not.toHaveProperty("lag"); + expect(res.body).not.toHaveProperty("error"); + expect(sub).toContain(res.body.database); + expect(sub).toContain(res.body.ingest); + expect(sub).toContain(res.body.webhookQueue); + }); + + it("does not surface the underlying error message when a dependency throws", async () => { + jest + .spyOn(lagMonitor, "getLagStatus") + .mockRejectedValue(new Error("postgres://secret-host:5432 refused")); + const res = await supertest(app).get("/readyz"); + const serialized = JSON.stringify(res.body); + expect(serialized).not.toContain("secret-host"); + expect(serialized).not.toContain("postgres"); + }); + + it("liveness response contains only status and timestamp", async () => { + const res = await supertest(app).get("/health"); + expect(Object.keys(res.body).sort()).toEqual(["status", "timestamp"].sort()); + expect(res.body).not.toHaveProperty("version"); + }); +}); From 9c7e079eeb33b1898ba3cbf3eebdf53d64e4e159 Mon Sep 17 00:00:00 2001 From: mikkyvans0-source Date: Tue, 2 Jun 2026 12:02:49 +0100 Subject: [PATCH 2/2] feat: split liveness and readiness probes with dependency checks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages. --- backend/docs/health.md | 132 ++++++++++++++ backend/src/app.ts | 12 +- backend/src/lib/database.ts | 21 +++ backend/src/routes/health.ts | 131 ++++++++++++++ backend/src/tests/readiness.test.ts | 257 ++++++++++++++++++++++++++++ 5 files changed, 545 insertions(+), 8 deletions(-) create mode 100644 backend/docs/health.md create mode 100644 backend/src/routes/health.ts create mode 100644 backend/src/tests/readiness.test.ts diff --git a/backend/docs/health.md b/backend/docs/health.md new file mode 100644 index 00000000..a9d2df7b --- /dev/null +++ b/backend/docs/health.md @@ -0,0 +1,132 @@ +# Health, Liveness, and Readiness Probes + +The backend exposes two distinct kinds of health signal. They answer different +questions and orchestrators (Kubernetes, ECS, Nomad, …) act on them +differently. Conflating them — as a single flat `/health` that always returns +`ok` — causes traffic to be routed to instances that are up but unable to serve. + +All probes are mounted at the **root** of the app (not under `/api/v1`) and are +**unauthenticated**, because orchestrators probe them without credentials. + +| Endpoint | Kind | Cost | Checks dependencies | Healthy | Unhealthy | +| ---------- | --------- | ----- | ------------------- | ------- | --------- | +| `/health` | Liveness | cheap | no | `200` | — | +| `/livez` | Liveness | cheap | no | `200` | — | +| `/readyz` | Readiness | real | yes | `200` | `503` | + +## Liveness — `/health`, `/livez` + +> "Is the process up and able to serve an HTTP request at all?" + +Liveness is cheap and **dependency-free**. It returns `200` whenever the event +loop can service a request. A failing liveness probe instructs the orchestrator +to **restart** the container, so it must never consult downstream dependencies — +a transient database blip should not trigger a restart loop. + +`/health` is retained for backward compatibility; `/livez` is the conventional +alias. They are identical. + +```json +{ "status": "ok", "timestamp": "2026-06-02T12:00:00.000Z" } +``` + +## Readiness — `/readyz` + +> "Should this instance receive traffic right now?" + +Readiness probes real dependencies. A failing readiness probe pulls the instance +out of the load-balancer rotation **without restarting it**, so it can recover +and rejoin once its dependencies are healthy again. + +It returns: + +- `200` with `status: "ready"` when the instance can serve traffic. +- `503` with `status: "not_ready"` when a hard dependency is unavailable. +- `503` with `status: "maintenance"` when maintenance mode is enabled. + +```json +{ + "status": "ready", + "database": "ok", + "ingest": "ok", + "webhookQueue": "ok", + "timestamp": "2026-06-02T12:00:00.000Z" +} +``` + +### Sub-status semantics + +Each dependency reports a coarse `SubStatus`, the same pattern used by +`/api/v1/admin/monitoring`: + +- `ok` — healthy. +- `degraded` — serving but impaired. **Does not** fail readiness. +- `unavailable` — could not be reached / unusable. **Fails** readiness (`503`). + +| Sub-status | Probe | `degraded` when | `unavailable` when | +| -------------- | ------------------------------ | ------------------------------------------ | ------------------------------------------- | +| `database` | `pingDatabase()` (`SELECT 1`) | — | connection cannot open or execute | +| `ingest` | `lagMonitor.getLagStatus()` | lag ≥ warn threshold, < critical threshold | lag ≥ critical threshold, or probe throws | +| `webhookQueue` | `webhookQueueService.getStats()` | queue is saturated (`size ≥ capacity`) | the queue's backing store is unreachable | + +Ingest lag thresholds are governed by `LagMonitor` and configurable via +`LAG_WARN_THRESHOLD` / `LAG_CRITICAL_THRESHOLD`. See [reliability.md](./reliability.md). + +The instance is **not ready** (`503`) if *any* sub-status is `unavailable`. +`degraded` sub-statuses are surfaced for observability but keep the instance in +rotation: a slightly stale index or a back-pressured queue is still serviceable. + +### Maintenance mode + +When `statusService.isMaintenanceEnabled()` is true, `/readyz` short-circuits +**before** probing any dependency and returns `503` with `status: "maintenance"`. +The instance is intentionally not serving and should be pulled from rotation. +Liveness is unaffected — the process is healthy, just drained. + +## Edge-case behaviour + +| Scenario | `/health`, `/livez` | `/readyz` | +| ---------------------------- | ------------------- | ----------------------------------------------- | +| All healthy | `200 ok` | `200 ready` | +| Database down | `200 ok` | `503 not_ready`, `database: unavailable` | +| Warn-level lag | `200 ok` | `200 ready`, `ingest: degraded` | +| Critical lag | `200 ok` | `503 not_ready`, `ingest: unavailable` | +| Lag probe throws | `200 ok` | `503 not_ready`, `ingest: unavailable` | +| Queue store unreachable | `200 ok` | `503 not_ready`, `webhookQueue: unavailable` | +| Queue saturated | `200 ok` | `200 ready`, `webhookQueue: degraded` | +| Maintenance mode | `200 ok` | `503 maintenance` | +| Partial failure (one dep) | `200 ok` | `503 not_ready` (failing dep `unavailable`, rest `ok`) | + +## Security + +These probes are unauthenticated, so their responses are deliberately minimal. +They expose only the coarse status enums above and a timestamp. They do **not** +leak: + +- internal hostnames or connection strings, +- application or dependency versions, +- absolute ledger numbers or ingest-lag values, +- queue depths or capacities, +- underlying exception messages (dependency errors are caught and collapsed to + `unavailable`). + +Richer, sensitive diagnostics (queue depths, invariant counters, cursor +positions, versions) remain behind API-key auth at +[`/api/v1/admin/monitoring`](./admin-monitoring.md). + +## Orchestrator configuration (Kubernetes example) + +```yaml +livenessProbe: + httpGet: + path: /livez + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 10 +readinessProbe: + httpGet: + path: /readyz + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 5 +``` diff --git a/backend/src/app.ts b/backend/src/app.ts index 08350047..99d3fb48 100644 --- a/backend/src/app.ts +++ b/backend/src/app.ts @@ -9,6 +9,7 @@ import { csrfMiddleware } from "./middleware/csrf"; import { corsOptionsDelegate, webhookCorsOptions } from "./config/cors"; import v1Routes from "./routes/v1"; import webhookRoutes from "./routes/webhooks"; +import healthRoutes from "./routes/health"; import { requestLogger } from "./middleware/request-logger"; const app = express(); @@ -47,14 +48,9 @@ app.use("/api/webhooks", cors(webhookCorsOptions), webhookRoutes); app.use(csrfMiddleware); app.use("/api/v1", v1Routes); -// Health check (root level as well if needed) -app.get("/health", (req, res) => { - res.json({ - status: "ok", - version: "1.0.0", - timestamp: new Date().toISOString(), - }); -}); +// Liveness (/health, /livez) and readiness (/readyz) probes. +// Mounted at the root and left unauthenticated so orchestrators can probe them. +app.use(healthRoutes); // 404 handler app.use((req, res) => { diff --git a/backend/src/lib/database.ts b/backend/src/lib/database.ts index 979cb367..6e444921 100644 --- a/backend/src/lib/database.ts +++ b/backend/src/lib/database.ts @@ -77,6 +77,27 @@ export function getStatementCacheStats() { }; } +/** + * Probe database connectivity with a trivial round-trip query. + * + * Used by the readiness endpoint to verify the SQLite connection can both + * open and execute. Returns true on success, false on any failure (a locked, + * corrupt, or unopenable database). Never throws so callers can branch on the + * boolean without their own try/catch. + * + * The query (`SELECT 1`) is constant and parameter-free, so it carries no + * user input and leaks no schema details. + */ +export function pingDatabase(): boolean { + try { + const db = getDatabase(); + const row = db.prepare("SELECT 1 AS ok").get(); + return row?.ok === 1; + } catch { + return false; + } +} + /** * Close the database connection and clear the statement cache. * Ensures clean shutdown and prevents memory leaks. diff --git a/backend/src/routes/health.ts b/backend/src/routes/health.ts new file mode 100644 index 00000000..a1da1199 --- /dev/null +++ b/backend/src/routes/health.ts @@ -0,0 +1,131 @@ +/** + * Liveness and readiness probes. + * + * These endpoints are intentionally mounted at the root of the app (not under + * /api/v1) and are unauthenticated, because container orchestrators (Kubernetes, + * ECS, Nomad, …) probe them without credentials. + * + * Two distinct concerns: + * + * GET /health, GET /livez — Liveness. "Is the process up and able to serve + * an HTTP request at all?" Cheap and dependency-free. A failing liveness + * probe tells the orchestrator to restart the container, so it must NOT + * consult downstream dependencies — a transient DB blip should not trigger + * a restart loop. + * + * GET /readyz — Readiness. "Should this instance receive traffic right now?" + * Probes real dependencies (DB connectivity, ingest lag, webhook queue) and + * returns 503 when any hard dependency is unavailable or when maintenance + * mode is enabled. A failing readiness probe pulls the instance out of the + * load-balancer rotation without restarting it. + * + * Security: responses expose only coarse status enums per sub-system. They do + * not leak internal hostnames, versions, queue depths, ledger numbers, or error + * messages to unauthenticated callers. The richer, authenticated diagnostics + * remain under /api/v1/admin/monitoring. + */ + +import { Router, Request, Response } from "express"; +import { pingDatabase } from "../lib/database"; +import { statusService } from "../services/statusService"; +import { lagMonitor } from "../services/lagMonitor"; +import { webhookQueueService } from "../services/webhookQueueService"; + +const router = Router(); + +/** + * Coarse per-dependency status, mirroring the SubStatus pattern used by + * /api/v1/admin/monitoring. "degraded" means serving but impaired (does not + * fail readiness); "unavailable" means the dependency could not be reached + * (fails readiness). + */ +type SubStatus = "ok" | "degraded" | "unavailable"; + +type ReadyStatus = "ready" | "not_ready" | "maintenance"; + +// --------------------------------------------------------------------------- +// Liveness +// --------------------------------------------------------------------------- + +function liveness(_req: Request, res: Response): void { + res.json({ + status: "ok", + timestamp: new Date().toISOString(), + }); +} + +// Keep the historical /health path as a liveness check, and add the +// conventional /livez alias. +router.get("/health", liveness); +router.get("/livez", liveness); + +// --------------------------------------------------------------------------- +// Readiness +// --------------------------------------------------------------------------- + +router.get("/readyz", async (_req: Request, res: Response) => { + // Maintenance mode short-circuits readiness: the instance is intentionally + // not serving, so it should be pulled from rotation regardless of deps. + if (statusService.isMaintenanceEnabled()) { + res.status(503).json({ + status: "maintenance" as ReadyStatus, + database: "ok" as SubStatus, + ingest: "ok" as SubStatus, + webhookQueue: "ok" as SubStatus, + timestamp: new Date().toISOString(), + }); + return; + } + + // --- Database connectivity (hard dependency) --------------------------- + let database: SubStatus = "ok"; + if (!pingDatabase()) { + database = "unavailable"; + } + + // --- Ingest lag -------------------------------------------------------- + // Reuse the LagMonitor degradation logic. "warn" lag is degraded but still + // serviceable; "critical" lag means the indexed view is too stale to trust, + // so we treat it as unavailable for readiness. + let ingest: SubStatus = "ok"; + try { + const lag = await lagMonitor.getLagStatus(); + if (lag.isCritical) { + ingest = "unavailable"; + } else if (lag.isDegraded) { + ingest = "degraded"; + } + } catch { + ingest = "unavailable"; + } + + // --- Webhook queue health (hard dependency on its backing store) ------- + // A throw here means the queue's store is unreachable. Saturation (queue at + // capacity) is back-pressure, not unreadiness, so it is reported as degraded. + let webhookQueue: SubStatus = "ok"; + try { + const stats = webhookQueueService.getStats(); + if (stats.capacity > 0 && stats.size >= stats.capacity) { + webhookQueue = "degraded"; + } + } catch { + webhookQueue = "unavailable"; + } + + const unavailable = + database === "unavailable" || + ingest === "unavailable" || + webhookQueue === "unavailable"; + + const status: ReadyStatus = unavailable ? "not_ready" : "ready"; + + res.status(unavailable ? 503 : 200).json({ + status, + database, + ingest, + webhookQueue, + timestamp: new Date().toISOString(), + }); +}); + +export default router; diff --git a/backend/src/tests/readiness.test.ts b/backend/src/tests/readiness.test.ts new file mode 100644 index 00000000..68dec63f --- /dev/null +++ b/backend/src/tests/readiness.test.ts @@ -0,0 +1,257 @@ +/** + * Liveness and readiness probe tests. + * + * Covers: + * - Liveness (/health, /livez) is cheap, always 200, dependency-free. + * - Readiness (/readyz) probes DB connectivity, ingest lag, and the webhook + * queue, and honours maintenance mode. + * - Edge cases: DB down, high (critical) lag, maintenance mode, partial + * dependency failure, queue saturation. + * - Security: probes do not leak internal hostnames, versions, or error + * details to unauthenticated callers. + */ + +import express from "express"; +import supertest from "supertest"; +import healthRoutes from "../routes/health"; +import { statusService } from "../services/statusService"; +import { lagMonitor } from "../services/lagMonitor"; +import { webhookQueueService } from "../services/webhookQueueService"; +import * as database from "../lib/database"; + +// Mount the health router the same way app.ts does: at the root, with no auth. +// Probes are unauthenticated, so no X-API-Key header is sent anywhere here. +// (We mount the router in isolation rather than importing the full app so the +// probe behaviour is exercised independently of the rest of the route graph.) +const app = express(); +app.use(express.json()); +app.use(healthRoutes); + +const HEALTHY_QUEUE_STATS = { + depth: 0, + size: 0, + capacity: 5000, + overflowCount: 0, + pendingCount: 0, + successCount: 0, + failureCount: 0, + oldestTimestamp: null, +}; + +beforeEach(() => { + // Healthy baseline: maintenance off, lag well under the warn threshold. + statusService.setMaintenanceMode(false); + statusService.updateLastIndexedLedger(100000); + statusService.setMockCurrentLedger(100002); // lag = 2 + + // The test database has no webhook_queue schema, so stub the queue stats to + // a healthy value by default. Individual tests override this to exercise the + // saturated / unavailable paths. This keeps the suite focused on probe logic + // rather than queue persistence (covered by webhookQueue.persist.test.ts). + jest + .spyOn(webhookQueueService, "getStats") + .mockReturnValue(HEALTHY_QUEUE_STATS as any); +}); + +afterEach(() => { + statusService.setMaintenanceMode(false); + statusService.setMockCurrentLedger(null); + jest.restoreAllMocks(); +}); + +// --------------------------------------------------------------------------- +// Liveness +// --------------------------------------------------------------------------- + +describe("Liveness probe", () => { + it("GET /health returns 200 with status ok", async () => { + const res = await supertest(app).get("/health"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + expect(res.body.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + it("GET /livez returns 200 with status ok", async () => { + const res = await supertest(app).get("/livez"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + }); + + it("liveness stays up even when a dependency is down", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + const res = await supertest(app).get("/health"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + }); + + it("liveness is dependency-free (does not call pingDatabase)", async () => { + const spy = jest.spyOn(database, "pingDatabase"); + await supertest(app).get("/livez"); + expect(spy).not.toHaveBeenCalled(); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — happy path +// --------------------------------------------------------------------------- + +describe("Readiness probe — ready", () => { + it("GET /readyz returns 200 when all dependencies are healthy", async () => { + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.database).toBe("ok"); + expect(res.body.ingest).toBe("ok"); + expect(res.body.webhookQueue).toBe("ok"); + expect(res.body.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + it("stays ready with warn-level (degraded) lag", async () => { + statusService.setMockCurrentLedger(100020); // lag = 20, >= warn(10), < critical(50) + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.ingest).toBe("degraded"); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — maintenance mode +// --------------------------------------------------------------------------- + +describe("Readiness probe — maintenance mode", () => { + it("returns 503 with maintenance status when maintenance is enabled", async () => { + statusService.setMaintenanceMode(true); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("maintenance"); + }); + + it("short-circuits before probing dependencies", async () => { + statusService.setMaintenanceMode(true); + const spy = jest.spyOn(database, "pingDatabase"); + await supertest(app).get("/readyz"); + expect(spy).not.toHaveBeenCalled(); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — dependency failures +// --------------------------------------------------------------------------- + +describe("Readiness probe — DB down", () => { + it("returns 503 not_ready when the database is unreachable", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.database).toBe("unavailable"); + }); +}); + +describe("Readiness probe — high lag", () => { + it("returns 503 not_ready when ingest lag is critical", async () => { + statusService.setMockCurrentLedger(100100); // lag = 100, >= critical(50) + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.ingest).toBe("unavailable"); + }); + + it("returns 503 not_ready when the lag probe throws", async () => { + jest.spyOn(lagMonitor, "getLagStatus").mockRejectedValue(new Error("rpc down")); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.ingest).toBe("unavailable"); + }); +}); + +describe("Readiness probe — webhook queue", () => { + it("returns 503 not_ready when the queue store is unreachable", async () => { + jest.spyOn(webhookQueueService, "getStats").mockImplementation(() => { + throw new Error("queue store unavailable"); + }); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.webhookQueue).toBe("unavailable"); + }); + + it("stays ready (degraded) when the queue is saturated", async () => { + jest.spyOn(webhookQueueService, "getStats").mockReturnValue({ + depth: 5000, + size: 5000, + capacity: 5000, + overflowCount: 3, + pendingCount: 5000, + successCount: 0, + failureCount: 0, + oldestTimestamp: null, + } as any); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.webhookQueue).toBe("degraded"); + }); +}); + +describe("Readiness probe — partial dependency failure", () => { + it("a single unavailable dependency fails readiness while others stay ok", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + statusService.setMockCurrentLedger(100002); // lag = 2, healthy + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.database).toBe("unavailable"); + expect(res.body.ingest).toBe("ok"); + expect(res.body.webhookQueue).toBe("ok"); + }); + + it("degraded lag plus DB down still reports both sub-statuses", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + statusService.setMockCurrentLedger(100020); // lag = 20, degraded + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.database).toBe("unavailable"); + expect(res.body.ingest).toBe("degraded"); + }); +}); + +// --------------------------------------------------------------------------- +// Security — no information leakage to unauthenticated callers +// --------------------------------------------------------------------------- + +describe("Readiness probe — does not leak internal details", () => { + const sub = ["ok", "degraded", "unavailable"]; + + it("readiness response contains only coarse status fields", async () => { + const res = await supertest(app).get("/readyz"); + expect(Object.keys(res.body).sort()).toEqual( + ["database", "ingest", "status", "timestamp", "webhookQueue"].sort() + ); + // No version, hostname, ledger numbers, queue depths, or error strings. + expect(res.body).not.toHaveProperty("version"); + expect(res.body).not.toHaveProperty("host"); + expect(res.body).not.toHaveProperty("lag"); + expect(res.body).not.toHaveProperty("error"); + expect(sub).toContain(res.body.database); + expect(sub).toContain(res.body.ingest); + expect(sub).toContain(res.body.webhookQueue); + }); + + it("does not surface the underlying error message when a dependency throws", async () => { + jest + .spyOn(lagMonitor, "getLagStatus") + .mockRejectedValue(new Error("postgres://secret-host:5432 refused")); + const res = await supertest(app).get("/readyz"); + const serialized = JSON.stringify(res.body); + expect(serialized).not.toContain("secret-host"); + expect(serialized).not.toContain("postgres"); + }); + + it("liveness response contains only status and timestamp", async () => { + const res = await supertest(app).get("/health"); + expect(Object.keys(res.body).sort()).toEqual(["status", "timestamp"].sort()); + expect(res.body).not.toHaveProperty("version"); + }); +});