diff --git a/backend/docs/health.md b/backend/docs/health.md new file mode 100644 index 00000000..a9d2df7b --- /dev/null +++ b/backend/docs/health.md @@ -0,0 +1,132 @@ +# Health, Liveness, and Readiness Probes + +The backend exposes two distinct kinds of health signal. They answer different +questions and orchestrators (Kubernetes, ECS, Nomad, …) act on them +differently. Conflating them — as a single flat `/health` that always returns +`ok` — causes traffic to be routed to instances that are up but unable to serve. + +All probes are mounted at the **root** of the app (not under `/api/v1`) and are +**unauthenticated**, because orchestrators probe them without credentials. + +| Endpoint | Kind | Cost | Checks dependencies | Healthy | Unhealthy | +| ---------- | --------- | ----- | ------------------- | ------- | --------- | +| `/health` | Liveness | cheap | no | `200` | — | +| `/livez` | Liveness | cheap | no | `200` | — | +| `/readyz` | Readiness | real | yes | `200` | `503` | + +## Liveness — `/health`, `/livez` + +> "Is the process up and able to serve an HTTP request at all?" + +Liveness is cheap and **dependency-free**. It returns `200` whenever the event +loop can service a request. A failing liveness probe instructs the orchestrator +to **restart** the container, so it must never consult downstream dependencies — +a transient database blip should not trigger a restart loop. + +`/health` is retained for backward compatibility; `/livez` is the conventional +alias. They are identical. + +```json +{ "status": "ok", "timestamp": "2026-06-02T12:00:00.000Z" } +``` + +## Readiness — `/readyz` + +> "Should this instance receive traffic right now?" + +Readiness probes real dependencies. A failing readiness probe pulls the instance +out of the load-balancer rotation **without restarting it**, so it can recover +and rejoin once its dependencies are healthy again. + +It returns: + +- `200` with `status: "ready"` when the instance can serve traffic. +- `503` with `status: "not_ready"` when a hard dependency is unavailable. +- `503` with `status: "maintenance"` when maintenance mode is enabled. + +```json +{ + "status": "ready", + "database": "ok", + "ingest": "ok", + "webhookQueue": "ok", + "timestamp": "2026-06-02T12:00:00.000Z" +} +``` + +### Sub-status semantics + +Each dependency reports a coarse `SubStatus`, the same pattern used by +`/api/v1/admin/monitoring`: + +- `ok` — healthy. +- `degraded` — serving but impaired. **Does not** fail readiness. +- `unavailable` — could not be reached / unusable. **Fails** readiness (`503`). + +| Sub-status | Probe | `degraded` when | `unavailable` when | +| -------------- | ------------------------------ | ------------------------------------------ | ------------------------------------------- | +| `database` | `pingDatabase()` (`SELECT 1`) | — | connection cannot open or execute | +| `ingest` | `lagMonitor.getLagStatus()` | lag ≥ warn threshold, < critical threshold | lag ≥ critical threshold, or probe throws | +| `webhookQueue` | `webhookQueueService.getStats()` | queue is saturated (`size ≥ capacity`) | the queue's backing store is unreachable | + +Ingest lag thresholds are governed by `LagMonitor` and configurable via +`LAG_WARN_THRESHOLD` / `LAG_CRITICAL_THRESHOLD`. See [reliability.md](./reliability.md). + +The instance is **not ready** (`503`) if *any* sub-status is `unavailable`. +`degraded` sub-statuses are surfaced for observability but keep the instance in +rotation: a slightly stale index or a back-pressured queue is still serviceable. + +### Maintenance mode + +When `statusService.isMaintenanceEnabled()` is true, `/readyz` short-circuits +**before** probing any dependency and returns `503` with `status: "maintenance"`. +The instance is intentionally not serving and should be pulled from rotation. +Liveness is unaffected — the process is healthy, just drained. + +## Edge-case behaviour + +| Scenario | `/health`, `/livez` | `/readyz` | +| ---------------------------- | ------------------- | ----------------------------------------------- | +| All healthy | `200 ok` | `200 ready` | +| Database down | `200 ok` | `503 not_ready`, `database: unavailable` | +| Warn-level lag | `200 ok` | `200 ready`, `ingest: degraded` | +| Critical lag | `200 ok` | `503 not_ready`, `ingest: unavailable` | +| Lag probe throws | `200 ok` | `503 not_ready`, `ingest: unavailable` | +| Queue store unreachable | `200 ok` | `503 not_ready`, `webhookQueue: unavailable` | +| Queue saturated | `200 ok` | `200 ready`, `webhookQueue: degraded` | +| Maintenance mode | `200 ok` | `503 maintenance` | +| Partial failure (one dep) | `200 ok` | `503 not_ready` (failing dep `unavailable`, rest `ok`) | + +## Security + +These probes are unauthenticated, so their responses are deliberately minimal. +They expose only the coarse status enums above and a timestamp. They do **not** +leak: + +- internal hostnames or connection strings, +- application or dependency versions, +- absolute ledger numbers or ingest-lag values, +- queue depths or capacities, +- underlying exception messages (dependency errors are caught and collapsed to + `unavailable`). + +Richer, sensitive diagnostics (queue depths, invariant counters, cursor +positions, versions) remain behind API-key auth at +[`/api/v1/admin/monitoring`](./admin-monitoring.md). + +## Orchestrator configuration (Kubernetes example) + +```yaml +livenessProbe: + httpGet: + path: /livez + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 10 +readinessProbe: + httpGet: + path: /readyz + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 5 +``` diff --git a/backend/src/app.ts b/backend/src/app.ts index 08350047..99d3fb48 100644 --- a/backend/src/app.ts +++ b/backend/src/app.ts @@ -9,6 +9,7 @@ import { csrfMiddleware } from "./middleware/csrf"; import { corsOptionsDelegate, webhookCorsOptions } from "./config/cors"; import v1Routes from "./routes/v1"; import webhookRoutes from "./routes/webhooks"; +import healthRoutes from "./routes/health"; import { requestLogger } from "./middleware/request-logger"; const app = express(); @@ -47,14 +48,9 @@ app.use("/api/webhooks", cors(webhookCorsOptions), webhookRoutes); app.use(csrfMiddleware); app.use("/api/v1", v1Routes); -// Health check (root level as well if needed) -app.get("/health", (req, res) => { - res.json({ - status: "ok", - version: "1.0.0", - timestamp: new Date().toISOString(), - }); -}); +// Liveness (/health, /livez) and readiness (/readyz) probes. +// Mounted at the root and left unauthenticated so orchestrators can probe them. +app.use(healthRoutes); // 404 handler app.use((req, res) => { diff --git a/backend/src/lib/database.ts b/backend/src/lib/database.ts index 979cb367..6e444921 100644 --- a/backend/src/lib/database.ts +++ b/backend/src/lib/database.ts @@ -77,6 +77,27 @@ export function getStatementCacheStats() { }; } +/** + * Probe database connectivity with a trivial round-trip query. + * + * Used by the readiness endpoint to verify the SQLite connection can both + * open and execute. Returns true on success, false on any failure (a locked, + * corrupt, or unopenable database). Never throws so callers can branch on the + * boolean without their own try/catch. + * + * The query (`SELECT 1`) is constant and parameter-free, so it carries no + * user input and leaks no schema details. + */ +export function pingDatabase(): boolean { + try { + const db = getDatabase(); + const row = db.prepare("SELECT 1 AS ok").get(); + return row?.ok === 1; + } catch { + return false; + } +} + /** * Close the database connection and clear the statement cache. * Ensures clean shutdown and prevents memory leaks. diff --git a/backend/src/routes/health.ts b/backend/src/routes/health.ts new file mode 100644 index 00000000..a1da1199 --- /dev/null +++ b/backend/src/routes/health.ts @@ -0,0 +1,131 @@ +/** + * Liveness and readiness probes. + * + * These endpoints are intentionally mounted at the root of the app (not under + * /api/v1) and are unauthenticated, because container orchestrators (Kubernetes, + * ECS, Nomad, …) probe them without credentials. + * + * Two distinct concerns: + * + * GET /health, GET /livez — Liveness. "Is the process up and able to serve + * an HTTP request at all?" Cheap and dependency-free. A failing liveness + * probe tells the orchestrator to restart the container, so it must NOT + * consult downstream dependencies — a transient DB blip should not trigger + * a restart loop. + * + * GET /readyz — Readiness. "Should this instance receive traffic right now?" + * Probes real dependencies (DB connectivity, ingest lag, webhook queue) and + * returns 503 when any hard dependency is unavailable or when maintenance + * mode is enabled. A failing readiness probe pulls the instance out of the + * load-balancer rotation without restarting it. + * + * Security: responses expose only coarse status enums per sub-system. They do + * not leak internal hostnames, versions, queue depths, ledger numbers, or error + * messages to unauthenticated callers. The richer, authenticated diagnostics + * remain under /api/v1/admin/monitoring. + */ + +import { Router, Request, Response } from "express"; +import { pingDatabase } from "../lib/database"; +import { statusService } from "../services/statusService"; +import { lagMonitor } from "../services/lagMonitor"; +import { webhookQueueService } from "../services/webhookQueueService"; + +const router = Router(); + +/** + * Coarse per-dependency status, mirroring the SubStatus pattern used by + * /api/v1/admin/monitoring. "degraded" means serving but impaired (does not + * fail readiness); "unavailable" means the dependency could not be reached + * (fails readiness). + */ +type SubStatus = "ok" | "degraded" | "unavailable"; + +type ReadyStatus = "ready" | "not_ready" | "maintenance"; + +// --------------------------------------------------------------------------- +// Liveness +// --------------------------------------------------------------------------- + +function liveness(_req: Request, res: Response): void { + res.json({ + status: "ok", + timestamp: new Date().toISOString(), + }); +} + +// Keep the historical /health path as a liveness check, and add the +// conventional /livez alias. +router.get("/health", liveness); +router.get("/livez", liveness); + +// --------------------------------------------------------------------------- +// Readiness +// --------------------------------------------------------------------------- + +router.get("/readyz", async (_req: Request, res: Response) => { + // Maintenance mode short-circuits readiness: the instance is intentionally + // not serving, so it should be pulled from rotation regardless of deps. + if (statusService.isMaintenanceEnabled()) { + res.status(503).json({ + status: "maintenance" as ReadyStatus, + database: "ok" as SubStatus, + ingest: "ok" as SubStatus, + webhookQueue: "ok" as SubStatus, + timestamp: new Date().toISOString(), + }); + return; + } + + // --- Database connectivity (hard dependency) --------------------------- + let database: SubStatus = "ok"; + if (!pingDatabase()) { + database = "unavailable"; + } + + // --- Ingest lag -------------------------------------------------------- + // Reuse the LagMonitor degradation logic. "warn" lag is degraded but still + // serviceable; "critical" lag means the indexed view is too stale to trust, + // so we treat it as unavailable for readiness. + let ingest: SubStatus = "ok"; + try { + const lag = await lagMonitor.getLagStatus(); + if (lag.isCritical) { + ingest = "unavailable"; + } else if (lag.isDegraded) { + ingest = "degraded"; + } + } catch { + ingest = "unavailable"; + } + + // --- Webhook queue health (hard dependency on its backing store) ------- + // A throw here means the queue's store is unreachable. Saturation (queue at + // capacity) is back-pressure, not unreadiness, so it is reported as degraded. + let webhookQueue: SubStatus = "ok"; + try { + const stats = webhookQueueService.getStats(); + if (stats.capacity > 0 && stats.size >= stats.capacity) { + webhookQueue = "degraded"; + } + } catch { + webhookQueue = "unavailable"; + } + + const unavailable = + database === "unavailable" || + ingest === "unavailable" || + webhookQueue === "unavailable"; + + const status: ReadyStatus = unavailable ? "not_ready" : "ready"; + + res.status(unavailable ? 503 : 200).json({ + status, + database, + ingest, + webhookQueue, + timestamp: new Date().toISOString(), + }); +}); + +export default router; diff --git a/backend/src/tests/readiness.test.ts b/backend/src/tests/readiness.test.ts new file mode 100644 index 00000000..68dec63f --- /dev/null +++ b/backend/src/tests/readiness.test.ts @@ -0,0 +1,257 @@ +/** + * Liveness and readiness probe tests. + * + * Covers: + * - Liveness (/health, /livez) is cheap, always 200, dependency-free. + * - Readiness (/readyz) probes DB connectivity, ingest lag, and the webhook + * queue, and honours maintenance mode. + * - Edge cases: DB down, high (critical) lag, maintenance mode, partial + * dependency failure, queue saturation. + * - Security: probes do not leak internal hostnames, versions, or error + * details to unauthenticated callers. + */ + +import express from "express"; +import supertest from "supertest"; +import healthRoutes from "../routes/health"; +import { statusService } from "../services/statusService"; +import { lagMonitor } from "../services/lagMonitor"; +import { webhookQueueService } from "../services/webhookQueueService"; +import * as database from "../lib/database"; + +// Mount the health router the same way app.ts does: at the root, with no auth. +// Probes are unauthenticated, so no X-API-Key header is sent anywhere here. +// (We mount the router in isolation rather than importing the full app so the +// probe behaviour is exercised independently of the rest of the route graph.) +const app = express(); +app.use(express.json()); +app.use(healthRoutes); + +const HEALTHY_QUEUE_STATS = { + depth: 0, + size: 0, + capacity: 5000, + overflowCount: 0, + pendingCount: 0, + successCount: 0, + failureCount: 0, + oldestTimestamp: null, +}; + +beforeEach(() => { + // Healthy baseline: maintenance off, lag well under the warn threshold. + statusService.setMaintenanceMode(false); + statusService.updateLastIndexedLedger(100000); + statusService.setMockCurrentLedger(100002); // lag = 2 + + // The test database has no webhook_queue schema, so stub the queue stats to + // a healthy value by default. Individual tests override this to exercise the + // saturated / unavailable paths. This keeps the suite focused on probe logic + // rather than queue persistence (covered by webhookQueue.persist.test.ts). + jest + .spyOn(webhookQueueService, "getStats") + .mockReturnValue(HEALTHY_QUEUE_STATS as any); +}); + +afterEach(() => { + statusService.setMaintenanceMode(false); + statusService.setMockCurrentLedger(null); + jest.restoreAllMocks(); +}); + +// --------------------------------------------------------------------------- +// Liveness +// --------------------------------------------------------------------------- + +describe("Liveness probe", () => { + it("GET /health returns 200 with status ok", async () => { + const res = await supertest(app).get("/health"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + expect(res.body.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + it("GET /livez returns 200 with status ok", async () => { + const res = await supertest(app).get("/livez"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + }); + + it("liveness stays up even when a dependency is down", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + const res = await supertest(app).get("/health"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ok"); + }); + + it("liveness is dependency-free (does not call pingDatabase)", async () => { + const spy = jest.spyOn(database, "pingDatabase"); + await supertest(app).get("/livez"); + expect(spy).not.toHaveBeenCalled(); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — happy path +// --------------------------------------------------------------------------- + +describe("Readiness probe — ready", () => { + it("GET /readyz returns 200 when all dependencies are healthy", async () => { + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.database).toBe("ok"); + expect(res.body.ingest).toBe("ok"); + expect(res.body.webhookQueue).toBe("ok"); + expect(res.body.timestamp).toMatch(/^\d{4}-\d{2}-\d{2}T/); + }); + + it("stays ready with warn-level (degraded) lag", async () => { + statusService.setMockCurrentLedger(100020); // lag = 20, >= warn(10), < critical(50) + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.ingest).toBe("degraded"); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — maintenance mode +// --------------------------------------------------------------------------- + +describe("Readiness probe — maintenance mode", () => { + it("returns 503 with maintenance status when maintenance is enabled", async () => { + statusService.setMaintenanceMode(true); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("maintenance"); + }); + + it("short-circuits before probing dependencies", async () => { + statusService.setMaintenanceMode(true); + const spy = jest.spyOn(database, "pingDatabase"); + await supertest(app).get("/readyz"); + expect(spy).not.toHaveBeenCalled(); + }); +}); + +// --------------------------------------------------------------------------- +// Readiness — dependency failures +// --------------------------------------------------------------------------- + +describe("Readiness probe — DB down", () => { + it("returns 503 not_ready when the database is unreachable", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.database).toBe("unavailable"); + }); +}); + +describe("Readiness probe — high lag", () => { + it("returns 503 not_ready when ingest lag is critical", async () => { + statusService.setMockCurrentLedger(100100); // lag = 100, >= critical(50) + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.ingest).toBe("unavailable"); + }); + + it("returns 503 not_ready when the lag probe throws", async () => { + jest.spyOn(lagMonitor, "getLagStatus").mockRejectedValue(new Error("rpc down")); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.ingest).toBe("unavailable"); + }); +}); + +describe("Readiness probe — webhook queue", () => { + it("returns 503 not_ready when the queue store is unreachable", async () => { + jest.spyOn(webhookQueueService, "getStats").mockImplementation(() => { + throw new Error("queue store unavailable"); + }); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.webhookQueue).toBe("unavailable"); + }); + + it("stays ready (degraded) when the queue is saturated", async () => { + jest.spyOn(webhookQueueService, "getStats").mockReturnValue({ + depth: 5000, + size: 5000, + capacity: 5000, + overflowCount: 3, + pendingCount: 5000, + successCount: 0, + failureCount: 0, + oldestTimestamp: null, + } as any); + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(200); + expect(res.body.status).toBe("ready"); + expect(res.body.webhookQueue).toBe("degraded"); + }); +}); + +describe("Readiness probe — partial dependency failure", () => { + it("a single unavailable dependency fails readiness while others stay ok", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + statusService.setMockCurrentLedger(100002); // lag = 2, healthy + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.status).toBe("not_ready"); + expect(res.body.database).toBe("unavailable"); + expect(res.body.ingest).toBe("ok"); + expect(res.body.webhookQueue).toBe("ok"); + }); + + it("degraded lag plus DB down still reports both sub-statuses", async () => { + jest.spyOn(database, "pingDatabase").mockReturnValue(false); + statusService.setMockCurrentLedger(100020); // lag = 20, degraded + const res = await supertest(app).get("/readyz"); + expect(res.status).toBe(503); + expect(res.body.database).toBe("unavailable"); + expect(res.body.ingest).toBe("degraded"); + }); +}); + +// --------------------------------------------------------------------------- +// Security — no information leakage to unauthenticated callers +// --------------------------------------------------------------------------- + +describe("Readiness probe — does not leak internal details", () => { + const sub = ["ok", "degraded", "unavailable"]; + + it("readiness response contains only coarse status fields", async () => { + const res = await supertest(app).get("/readyz"); + expect(Object.keys(res.body).sort()).toEqual( + ["database", "ingest", "status", "timestamp", "webhookQueue"].sort() + ); + // No version, hostname, ledger numbers, queue depths, or error strings. + expect(res.body).not.toHaveProperty("version"); + expect(res.body).not.toHaveProperty("host"); + expect(res.body).not.toHaveProperty("lag"); + expect(res.body).not.toHaveProperty("error"); + expect(sub).toContain(res.body.database); + expect(sub).toContain(res.body.ingest); + expect(sub).toContain(res.body.webhookQueue); + }); + + it("does not surface the underlying error message when a dependency throws", async () => { + jest + .spyOn(lagMonitor, "getLagStatus") + .mockRejectedValue(new Error("postgres://secret-host:5432 refused")); + const res = await supertest(app).get("/readyz"); + const serialized = JSON.stringify(res.body); + expect(serialized).not.toContain("secret-host"); + expect(serialized).not.toContain("postgres"); + }); + + it("liveness response contains only status and timestamp", async () => { + const res = await supertest(app).get("/health"); + expect(Object.keys(res.body).sort()).toEqual(["status", "timestamp"].sort()); + expect(res.body).not.toHaveProperty("version"); + }); +});