Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions backend/docs/health.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Health, Liveness, and Readiness Probes

The backend exposes two distinct kinds of health signal. They answer different
questions and orchestrators (Kubernetes, ECS, Nomad, …) act on them
differently. Conflating them — as a single flat `/health` that always returns
`ok` — causes traffic to be routed to instances that are up but unable to serve.

All probes are mounted at the **root** of the app (not under `/api/v1`) and are
**unauthenticated**, because orchestrators probe them without credentials.

| Endpoint | Kind | Cost | Checks dependencies | Healthy | Unhealthy |
| ---------- | --------- | ----- | ------------------- | ------- | --------- |
| `/health` | Liveness | cheap | no | `200` | — |
| `/livez` | Liveness | cheap | no | `200` | — |
| `/readyz` | Readiness | real | yes | `200` | `503` |

## Liveness — `/health`, `/livez`

> "Is the process up and able to serve an HTTP request at all?"

Liveness is cheap and **dependency-free**. It returns `200` whenever the event
loop can service a request. A failing liveness probe instructs the orchestrator
to **restart** the container, so it must never consult downstream dependencies —
a transient database blip should not trigger a restart loop.

`/health` is retained for backward compatibility; `/livez` is the conventional
alias. They are identical.

```json
{ "status": "ok", "timestamp": "2026-06-02T12:00:00.000Z" }
```

## Readiness — `/readyz`

> "Should this instance receive traffic right now?"

Readiness probes real dependencies. A failing readiness probe pulls the instance
out of the load-balancer rotation **without restarting it**, so it can recover
and rejoin once its dependencies are healthy again.

It returns:

- `200` with `status: "ready"` when the instance can serve traffic.
- `503` with `status: "not_ready"` when a hard dependency is unavailable.
- `503` with `status: "maintenance"` when maintenance mode is enabled.

```json
{
"status": "ready",
"database": "ok",
"ingest": "ok",
"webhookQueue": "ok",
"timestamp": "2026-06-02T12:00:00.000Z"
}
```

### Sub-status semantics

Each dependency reports a coarse `SubStatus`, the same pattern used by
`/api/v1/admin/monitoring`:

- `ok` — healthy.
- `degraded` — serving but impaired. **Does not** fail readiness.
- `unavailable` — could not be reached / unusable. **Fails** readiness (`503`).

| Sub-status | Probe | `degraded` when | `unavailable` when |
| -------------- | ------------------------------ | ------------------------------------------ | ------------------------------------------- |
| `database` | `pingDatabase()` (`SELECT 1`) | — | connection cannot open or execute |
| `ingest` | `lagMonitor.getLagStatus()` | lag ≥ warn threshold, < critical threshold | lag ≥ critical threshold, or probe throws |
| `webhookQueue` | `webhookQueueService.getStats()` | queue is saturated (`size ≥ capacity`) | the queue's backing store is unreachable |

Ingest lag thresholds are governed by `LagMonitor` and configurable via
`LAG_WARN_THRESHOLD` / `LAG_CRITICAL_THRESHOLD`. See [reliability.md](./reliability.md).

The instance is **not ready** (`503`) if *any* sub-status is `unavailable`.
`degraded` sub-statuses are surfaced for observability but keep the instance in
rotation: a slightly stale index or a back-pressured queue is still serviceable.

### Maintenance mode

When `statusService.isMaintenanceEnabled()` is true, `/readyz` short-circuits
**before** probing any dependency and returns `503` with `status: "maintenance"`.
The instance is intentionally not serving and should be pulled from rotation.
Liveness is unaffected — the process is healthy, just drained.

## Edge-case behaviour

| Scenario | `/health`, `/livez` | `/readyz` |
| ---------------------------- | ------------------- | ----------------------------------------------- |
| All healthy | `200 ok` | `200 ready` |
| Database down | `200 ok` | `503 not_ready`, `database: unavailable` |
| Warn-level lag | `200 ok` | `200 ready`, `ingest: degraded` |
| Critical lag | `200 ok` | `503 not_ready`, `ingest: unavailable` |
| Lag probe throws | `200 ok` | `503 not_ready`, `ingest: unavailable` |
| Queue store unreachable | `200 ok` | `503 not_ready`, `webhookQueue: unavailable` |
| Queue saturated | `200 ok` | `200 ready`, `webhookQueue: degraded` |
| Maintenance mode | `200 ok` | `503 maintenance` |
| Partial failure (one dep) | `200 ok` | `503 not_ready` (failing dep `unavailable`, rest `ok`) |

## Security

These probes are unauthenticated, so their responses are deliberately minimal.
They expose only the coarse status enums above and a timestamp. They do **not**
leak:

- internal hostnames or connection strings,
- application or dependency versions,
- absolute ledger numbers or ingest-lag values,
- queue depths or capacities,
- underlying exception messages (dependency errors are caught and collapsed to
`unavailable`).

Richer, sensitive diagnostics (queue depths, invariant counters, cursor
positions, versions) remain behind API-key auth at
[`/api/v1/admin/monitoring`](./admin-monitoring.md).

## Orchestrator configuration (Kubernetes example)

```yaml
livenessProbe:
httpGet:
path: /livez
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
```
12 changes: 4 additions & 8 deletions backend/src/app.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import { csrfMiddleware } from "./middleware/csrf";
import { corsOptionsDelegate, webhookCorsOptions } from "./config/cors";
import v1Routes from "./routes/v1";
import webhookRoutes from "./routes/webhooks";
import healthRoutes from "./routes/health";
import { requestLogger } from "./middleware/request-logger";

const app = express();
Expand Down Expand Up @@ -47,14 +48,9 @@ app.use("/api/webhooks", cors(webhookCorsOptions), webhookRoutes);
app.use(csrfMiddleware);
app.use("/api/v1", v1Routes);

// Health check (root level as well if needed)
app.get("/health", (req, res) => {
res.json({
status: "ok",
version: "1.0.0",
timestamp: new Date().toISOString(),
});
});
// Liveness (/health, /livez) and readiness (/readyz) probes.
// Mounted at the root and left unauthenticated so orchestrators can probe them.
app.use(healthRoutes);

// 404 handler
app.use((req, res) => {
Expand Down
21 changes: 21 additions & 0 deletions backend/src/lib/database.ts
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,27 @@ export function getStatementCacheStats() {
};
}

/**
* Probe database connectivity with a trivial round-trip query.
*
* Used by the readiness endpoint to verify the SQLite connection can both
* open and execute. Returns true on success, false on any failure (a locked,
* corrupt, or unopenable database). Never throws so callers can branch on the
* boolean without their own try/catch.
*
* The query (`SELECT 1`) is constant and parameter-free, so it carries no
* user input and leaks no schema details.
*/
export function pingDatabase(): boolean {
try {
const db = getDatabase();
const row = db.prepare("SELECT 1 AS ok").get();
return row?.ok === 1;
} catch {
return false;
}
}

/**
* Close the database connection and clear the statement cache.
* Ensures clean shutdown and prevents memory leaks.
Expand Down
131 changes: 131 additions & 0 deletions backend/src/routes/health.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
/**
* Liveness and readiness probes.
*
* These endpoints are intentionally mounted at the root of the app (not under
* /api/v1) and are unauthenticated, because container orchestrators (Kubernetes,
* ECS, Nomad, …) probe them without credentials.
*
* Two distinct concerns:
*
* GET /health, GET /livez — Liveness. "Is the process up and able to serve
* an HTTP request at all?" Cheap and dependency-free. A failing liveness
* probe tells the orchestrator to restart the container, so it must NOT
* consult downstream dependencies — a transient DB blip should not trigger
* a restart loop.
*
* GET /readyz — Readiness. "Should this instance receive traffic right now?"
* Probes real dependencies (DB connectivity, ingest lag, webhook queue) and
* returns 503 when any hard dependency is unavailable or when maintenance
* mode is enabled. A failing readiness probe pulls the instance out of the
* load-balancer rotation without restarting it.
*
* Security: responses expose only coarse status enums per sub-system. They do
* not leak internal hostnames, versions, queue depths, ledger numbers, or error
* messages to unauthenticated callers. The richer, authenticated diagnostics
* remain under /api/v1/admin/monitoring.
*/

import { Router, Request, Response } from "express";
import { pingDatabase } from "../lib/database";
import { statusService } from "../services/statusService";
import { lagMonitor } from "../services/lagMonitor";
import { webhookQueueService } from "../services/webhookQueueService";

const router = Router();

/**
* Coarse per-dependency status, mirroring the SubStatus pattern used by
* /api/v1/admin/monitoring. "degraded" means serving but impaired (does not
* fail readiness); "unavailable" means the dependency could not be reached
* (fails readiness).
*/
type SubStatus = "ok" | "degraded" | "unavailable";

type ReadyStatus = "ready" | "not_ready" | "maintenance";

// ---------------------------------------------------------------------------
// Liveness
// ---------------------------------------------------------------------------

function liveness(_req: Request, res: Response): void {
res.json({
status: "ok",
timestamp: new Date().toISOString(),
});
}

// Keep the historical /health path as a liveness check, and add the
// conventional /livez alias.
router.get("/health", liveness);
router.get("/livez", liveness);

// ---------------------------------------------------------------------------
// Readiness
// ---------------------------------------------------------------------------

router.get("/readyz", async (_req: Request, res: Response) => {
// Maintenance mode short-circuits readiness: the instance is intentionally
// not serving, so it should be pulled from rotation regardless of deps.
if (statusService.isMaintenanceEnabled()) {
res.status(503).json({
status: "maintenance" as ReadyStatus,
database: "ok" as SubStatus,
ingest: "ok" as SubStatus,
webhookQueue: "ok" as SubStatus,
timestamp: new Date().toISOString(),
});
return;
}

// --- Database connectivity (hard dependency) ---------------------------
let database: SubStatus = "ok";
if (!pingDatabase()) {
database = "unavailable";
}

// --- Ingest lag --------------------------------------------------------
// Reuse the LagMonitor degradation logic. "warn" lag is degraded but still
// serviceable; "critical" lag means the indexed view is too stale to trust,
// so we treat it as unavailable for readiness.
let ingest: SubStatus = "ok";
try {
const lag = await lagMonitor.getLagStatus();
if (lag.isCritical) {
ingest = "unavailable";
} else if (lag.isDegraded) {
ingest = "degraded";
}
} catch {
ingest = "unavailable";
}

// --- Webhook queue health (hard dependency on its backing store) -------
// A throw here means the queue's store is unreachable. Saturation (queue at
// capacity) is back-pressure, not unreadiness, so it is reported as degraded.
let webhookQueue: SubStatus = "ok";
try {
const stats = webhookQueueService.getStats();
if (stats.capacity > 0 && stats.size >= stats.capacity) {
webhookQueue = "degraded";
}
} catch {
webhookQueue = "unavailable";
}

const unavailable =
database === "unavailable" ||
ingest === "unavailable" ||
webhookQueue === "unavailable";

const status: ReadyStatus = unavailable ? "not_ready" : "ready";

res.status(unavailable ? 503 : 200).json({
status,
database,
ingest,
webhookQueue,
timestamp: new Date().toISOString(),
});
});

export default router;
Loading
Loading