Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions backend/docs/health.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Health, Liveness, and Readiness Probes

The backend exposes two distinct kinds of health signal. They answer different
questions and orchestrators (Kubernetes, ECS, Nomad, …) act on them
differently. Conflating them — as a single flat `/health` that always returns
`ok` — causes traffic to be routed to instances that are up but unable to serve.

All probes are mounted at the **root** of the app (not under `/api/v1`) and are
**unauthenticated**, because orchestrators probe them without credentials.

| Endpoint | Kind | Cost | Checks dependencies | Healthy | Unhealthy |
| ---------- | --------- | ----- | ------------------- | ------- | --------- |
| `/health` | Liveness | cheap | no | `200` | — |
| `/livez` | Liveness | cheap | no | `200` | — |
| `/readyz` | Readiness | real | yes | `200` | `503` |

## Liveness — `/health`, `/livez`

> "Is the process up and able to serve an HTTP request at all?"

Liveness is cheap and **dependency-free**. It returns `200` whenever the event
loop can service a request. A failing liveness probe instructs the orchestrator
to **restart** the container, so it must never consult downstream dependencies —
a transient database blip should not trigger a restart loop.

`/health` is retained for backward compatibility; `/livez` is the conventional
alias. They are identical.

```json
{ "status": "ok", "timestamp": "2026-06-02T12:00:00.000Z" }
```

## Readiness — `/readyz`

> "Should this instance receive traffic right now?"

Readiness probes real dependencies. A failing readiness probe pulls the instance
out of the load-balancer rotation **without restarting it**, so it can recover
and rejoin once its dependencies are healthy again.

It returns:

- `200` with `status: "ready"` when the instance can serve traffic.
- `503` with `status: "not_ready"` when a hard dependency is unavailable.
- `503` with `status: "maintenance"` when maintenance mode is enabled.

```json
{
"status": "ready",
"database": "ok",
"ingest": "ok",
"webhookQueue": "ok",
"timestamp": "2026-06-02T12:00:00.000Z"
}
```

### Sub-status semantics

Each dependency reports a coarse `SubStatus`, the same pattern used by
`/api/v1/admin/monitoring`:

- `ok` — healthy.
- `degraded` — serving but impaired. **Does not** fail readiness.
- `unavailable` — could not be reached / unusable. **Fails** readiness (`503`).

| Sub-status | Probe | `degraded` when | `unavailable` when |
| -------------- | ------------------------------ | ------------------------------------------ | ------------------------------------------- |
| `database` | `pingDatabase()` (`SELECT 1`) | — | connection cannot open or execute |
| `ingest` | `lagMonitor.getLagStatus()` | lag ≥ warn threshold, < critical threshold | lag ≥ critical threshold, or probe throws |
| `webhookQueue` | `webhookQueueService.getStats()` | queue is saturated (`size ≥ capacity`) | the queue's backing store is unreachable |

Ingest lag thresholds are governed by `LagMonitor` and configurable via
`LAG_WARN_THRESHOLD` / `LAG_CRITICAL_THRESHOLD`. See [reliability.md](./reliability.md).

The instance is **not ready** (`503`) if *any* sub-status is `unavailable`.
`degraded` sub-statuses are surfaced for observability but keep the instance in
rotation: a slightly stale index or a back-pressured queue is still serviceable.

### Maintenance mode

When `statusService.isMaintenanceEnabled()` is true, `/readyz` short-circuits
**before** probing any dependency and returns `503` with `status: "maintenance"`.
The instance is intentionally not serving and should be pulled from rotation.
Liveness is unaffected — the process is healthy, just drained.

## Edge-case behaviour

| Scenario | `/health`, `/livez` | `/readyz` |
| ---------------------------- | ------------------- | ----------------------------------------------- |
| All healthy | `200 ok` | `200 ready` |
| Database down | `200 ok` | `503 not_ready`, `database: unavailable` |
| Warn-level lag | `200 ok` | `200 ready`, `ingest: degraded` |
| Critical lag | `200 ok` | `503 not_ready`, `ingest: unavailable` |
| Lag probe throws | `200 ok` | `503 not_ready`, `ingest: unavailable` |
| Queue store unreachable | `200 ok` | `503 not_ready`, `webhookQueue: unavailable` |
| Queue saturated | `200 ok` | `200 ready`, `webhookQueue: degraded` |
| Maintenance mode | `200 ok` | `503 maintenance` |
| Partial failure (one dep) | `200 ok` | `503 not_ready` (failing dep `unavailable`, rest `ok`) |

## Security

These probes are unauthenticated, so their responses are deliberately minimal.
They expose only the coarse status enums above and a timestamp. They do **not**
leak:

- internal hostnames or connection strings,
- application or dependency versions,
- absolute ledger numbers or ingest-lag values,
- queue depths or capacities,
- underlying exception messages (dependency errors are caught and collapsed to
`unavailable`).

Richer, sensitive diagnostics (queue depths, invariant counters, cursor
positions, versions) remain behind API-key auth at
[`/api/v1/admin/monitoring`](./admin-monitoring.md).

## Orchestrator configuration (Kubernetes example)

```yaml
livenessProbe:
httpGet:
path: /livez
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
```
177 changes: 177 additions & 0 deletions backend/docs/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Observability — Ingest Lag Alerting

This document describes how the QuickLendX backend turns indexer-lag threshold
breaches into **alerts** and how **degraded mode auto-recovery** works. It
complements [reliability.md](./reliability.md) (which covers how degraded mode
gates writes) and [logging.md](./logging.md) (log redaction policy).

The alerting logic lives in
[`src/services/lagMonitor.ts`](../src/services/lagMonitor.ts).

---

## Overview

`LagMonitor` computes indexer lag (in ledgers) as
`current_ledger - last_indexed_ledger`. Two thresholds classify the lag into a
**level**:

| Level | Condition | Effect |
| ---------- | ---------------------------------- | -------------------------------------- |
| `none` | `lag < warnThreshold` | Healthy. All endpoints available. |
| `warn` | `lag >= warnThreshold` | Degraded. Write endpoints gated (503). |
| `critical` | `lag >= criticalThreshold` | Critically degraded. All writes blocked. |

The level is consumed by:

- **`GET /api/v1/status`** — surfaces the current level to clients.
- **`degradedGuard`** middleware — gates write/sensitive endpoints.

Prior to this feature, threshold breaches were silent (no operator signal) and
recovery was implicit (a single good reading immediately re-opened the write
guard, allowing it to flap). This feature adds **alerts on transitions** and
**hysteresis-backed auto-recovery**.

---

## Thresholds & configuration

All four parameters are configurable via environment variables. Defaults are
chosen for a ~5s ledger cadence.

| Env var | Default | Meaning |
| ------------------------- | ------- | -------------------------------------------------------------------- |
| `LAG_WARN_THRESHOLD` | `10` | Lag (ledgers) at which the system becomes degraded (`warn`). |
| `LAG_CRITICAL_THRESHOLD` | `50` | Lag (ledgers) at which the system becomes critically degraded. |
| `LAG_HYSTERESIS_MARGIN` | `3` | Ledgers **below** a threshold the lag must fall to before recovering.|
| `LAG_RECOVERY_POLLS` | `3` | Consecutive recovered polls required before a degraded level clears. |

Non-numeric or empty values fall back to the defaults. `recoveryPolls` is
clamped to `>= 1` and `hysteresisMargin` to `>= 0`.

Thresholds can also be set at runtime in tests/bootstrap via
`setThresholds(warn, critical)` and `setHysteresis(margin, polls)`.

---

## Hysteresis & auto-recovery

To stop the monitor flapping when lag hovers around a threshold, the monitor
tracks an **effective level** separately from the **instantaneous level**
computed from the raw lag:

- **Escalation is immediate.** As soon as the raw lag reaches a higher level
(e.g. `lag >= criticalThreshold`), the effective level jumps there. This is
the fail-safe direction — a breach gates writes without delay.
- **De-escalation is sustained.** To clear a level, the raw lag must fall to
the **recovery threshold** (`threshold - hysteresisMargin`) and stay there
for `recoveryPolls` consecutive polls. A single breach anywhere in that
window resets the streak. Recovery steps down **one level at a time**
(`critical → warn → none`) so the `warn` write-guard window is never skipped.

Recovery thresholds with the defaults:

- Recover out of `critical` → `warn` when `lag <= 50 - 3 = 47` for 3 polls.
- Recover out of `warn` → `none` when `lag <= 10 - 3 = 7` for 3 polls.

`getLagStatus()` (used by `/status` and `degradedGuard`) reports the effective
level. It **escalates immediately** when called but **never auto-clears** — only
the scheduled `poll()` path performs de-escalation. This means the many guard
and status calls per interval can raise the level but can never lower it.

---

## Alert events

Alerts are emitted **only on transitions** of the effective level — never on
every poll. Each transition:

1. Logs a single structured JSON line (`type: "LAG_ALERT"`), at `WARN` for
escalations and `INFO` for recoveries.
2. Increments in-process counters (see [Metrics](#metrics)).
3. Notifies any subscribers registered via `onAlert(listener)`.

### Alert payload

```jsonc
{
"from": "warn", // level moved away from
"to": "critical", // level moved to
"direction": "escalation", // or "recovery"
"lag": 62, // raw lag at transition (ledgers)
"warnThreshold": 10,
"criticalThreshold": 50,
"at": "2026-06-02T12:00:00.000Z"
}
```

> **Security:** Alert payloads carry **only operational fields** — lag,
> thresholds, level, timestamp. They never include request bodies, wallet
> data, auth tokens, or any other secrets. The logged line uses the same
> fixed shape, so no caller-supplied data can leak into log sinks.

### Subscribing

```ts
import { lagMonitor } from "../services/lagMonitor";

const unsubscribe = lagMonitor.onAlert((event) => {
// forward to PagerDuty / Slack / metrics exporter, etc.
});
```

A throwing subscriber is isolated and never breaks the monitor.

---

## Polling

`poll()` reads a fresh lag value, advances the hysteresis state machine, and
emits any resulting transition alert. Schedule it on a fixed cadence (mirroring
the invariant scheduler pattern):

```ts
import { lagMonitor } from "../services/lagMonitor";

setInterval(() => {
void lagMonitor.poll();
}, 5000);
```

Do **not** call `poll()` per request — request paths should call the read-only
`getLagStatus()`.

### Missing / corrupt current-ledger reads

If the current-ledger read throws, or yields a non-finite or negative lag, the
monitor **fails safe to `critical`** (it returns `lag = criticalThreshold`).
An unknown reading must never silently clear a degraded state or open the write
guard.

---

## Metrics

`getAlertMetrics()` returns a defensive copy of the in-process counters,
suitable for exposure via the monitoring endpoint or a scraper:

| Field | Meaning |
| -------------------------- | -------------------------------------------------------- |
| `escalations` | Total escalation transitions observed. |
| `recoveries` | Total recovery transitions observed. |
| `transitionsTo` | Transition count by destination level (`none`/`warn`/`critical`). |
| `currentLevel` | Current effective level. |
| `consecutiveRecoveryPolls` | Consecutive polls the lag has been within recovery range. |

---

## Edge cases (covered by tests)

See [`src/tests/lagMonitor.alerts.test.ts`](../src/tests/lagMonitor.alerts.test.ts):

- **Flapping** around a threshold produces no spurious transitions; a single
good poll never clears a degraded state.
- **Sustained breach** holds the level with no duplicate alerts.
- **Rapid recovery** still drains one level per recovery window, preserving the
`warn` guard window.
- **Missing current-ledger read** fails safe to `critical`.
12 changes: 4 additions & 8 deletions backend/src/app.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import { csrfMiddleware } from "./middleware/csrf";
import { corsOptionsDelegate, webhookCorsOptions } from "./config/cors";
import v1Routes from "./routes/v1";
import webhookRoutes from "./routes/webhooks";
import healthRoutes from "./routes/health";
import { requestLogger } from "./middleware/request-logger";

const app = express();
Expand Down Expand Up @@ -47,14 +48,9 @@ app.use("/api/webhooks", cors(webhookCorsOptions), webhookRoutes);
app.use(csrfMiddleware);
app.use("/api/v1", v1Routes);

// Health check (root level as well if needed)
app.get("/health", (req, res) => {
res.json({
status: "ok",
version: "1.0.0",
timestamp: new Date().toISOString(),
});
});
// Liveness (/health, /livez) and readiness (/readyz) probes.
// Mounted at the root and left unauthenticated so orchestrators can probe them.
app.use(healthRoutes);

// 404 handler
app.use((req, res) => {
Expand Down
21 changes: 21 additions & 0 deletions backend/src/lib/database.ts
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,27 @@ export function getStatementCacheStats() {
};
}

/**
* Probe database connectivity with a trivial round-trip query.
*
* Used by the readiness endpoint to verify the SQLite connection can both
* open and execute. Returns true on success, false on any failure (a locked,
* corrupt, or unopenable database). Never throws so callers can branch on the
* boolean without their own try/catch.
*
* The query (`SELECT 1`) is constant and parameter-free, so it carries no
* user input and leaks no schema details.
*/
export function pingDatabase(): boolean {
try {
const db = getDatabase();
const row = db.prepare("SELECT 1 AS ok").get();
return row?.ok === 1;
} catch {
return false;
}
}

/**
* Close the database connection and clear the statement cache.
* Ensures clean shutdown and prevents memory leaks.
Expand Down
Loading
Loading