feat: split liveness and readiness probes with dependency checks#1289
Merged
Conversation
The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@mikkyvans0-source Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
The app exposed a single flat /health that always returned status: ok, with no distinction between liveness (process up) and readiness (dependencies healthy). Orchestrators could therefore route traffic to instances that were up but unable to serve. Split the signal into two probes, mounted at the root and unauthenticated so orchestrators can reach them: - /health, /livez — cheap, dependency-free liveness check (always 200). - /readyz — readiness check that probes DB connectivity (pingDatabase), ingest lag (lagMonitor), and webhook queue health, honours maintenance mode, and returns 503 when not ready. Reuses the SubStatus / degradation pattern from the monitoring route: "degraded" stays in rotation, "unavailable" fails readiness. Add a pingDatabase() helper (SELECT 1 round-trip), readiness.test.ts covering DB-down, critical lag, maintenance mode, partial failure and queue-saturation edge cases plus information-leak checks, and docs/health.md documenting probe semantics. Probe responses expose only coarse status enums — no hostnames, versions, ledger numbers, or error messages.
…dency-probes' of https://github.com/mikkyvans0-source/quicklendx-protocol into Backend--Add-health/readiness/liveness-split-with-dependency-probes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Split the flat /health into proper liveness and readiness probes, mounted at the root of the app and left unauthenticated so orchestrators can reach them.
Closes #1065
Files changed (commit def6bce):
backend/src/routes/health.ts (new) — the probe router:
GET /health, GET /livez — cheap, dependency-free liveness (always 200).
GET /readyz — readiness: probes DB connectivity, ingest lag, and webhook queue, honours maintenance mode, returns 503 when not ready. Reuses the SubStatus (ok/degraded/unavailable) pattern from monitoring.ts: degraded stays in rotation, unavailable fails readiness. Maintenance mode short-circuits to 503 maintenance before probing deps.
backend/src/lib/database.ts — added pingDatabase() (a SELECT 1 round-trip that never throws) for the DB connectivity check.
backend/src/app.ts — replaced the inline flat /health handler with the new router (and dropped the leaked version field).
backend/src/tests/readiness.test.ts (new) — 18 tests: happy path, DB-down, critical lag, lag-probe-throws, maintenance, partial failure, queue saturation, queue-store-unreachable, and information-leak checks. All 18 pass; health.ts is at 100% coverage.
backend/docs/health.md (new) — documents probe semantics, sub-status table, edge cases, security notes, and a Kubernetes config example.
Things worth flagging
Pre-existing breakage on the base branch (not mine, left untouched): routes/v1/index.ts imports FileSystemRawEventStore which was renamed to FileRawEventStore; controllers/v1/invoices.ts has missing imports; several migration version-mismatch failures; webhookQueue.persist.test.ts fails identically on the clean base. Because of these, any test importing the full app.ts won't compile, so I mounted the health router in isolation in my test (verifying the same wiring app.ts uses). The full suite shows 31 pre-existing failures; my suite is green and none of the failures touch files I changed. I did not fix these since you asked me to stay on task.
The two test-rbac-.db files were SQLite artifacts the RBAC test wrote to backend/.data/ during the full run — I deleted them per your choice; they're regenerated each run. Consider gitignoring backend/.data/.db separately.
Docs went in backend/docs/health.md (where all other docs live) rather than a top-level docs/health.md.What I did
Split the flat /health into proper liveness and readiness probes, mounted at the root of the app and left unauthenticated so orchestrators can reach them.
Files changed (commit def6bce):
backend/src/routes/health.ts (new) — the probe router:
GET /health, GET /livez — cheap, dependency-free liveness (always 200).
GET /readyz — readiness: probes DB connectivity, ingest lag, and webhook queue, honours maintenance mode, returns 503 when not ready. Reuses the SubStatus (ok/degraded/unavailable) pattern from monitoring.ts: degraded stays in rotation, unavailable fails readiness. Maintenance mode short-circuits to 503 maintenance before probing deps.
backend/src/lib/database.ts — added pingDatabase() (a SELECT 1 round-trip that never throws) for the DB connectivity check.
backend/src/app.ts — replaced the inline flat /health handler with the new router (and dropped the leaked version field).
backend/src/tests/readiness.test.ts (new) — 18 tests: happy path, DB-down, critical lag, lag-probe-throws, maintenance, partial failure, queue saturation, queue-store-unreachable, and information-leak checks. All 18 pass; health.ts is at 100% coverage.
backend/docs/health.md (new) — documents probe semantics, sub-status table, edge cases, security notes, and a Kubernetes config example.
Things worth flagging
Pre-existing breakage on the base branch (not mine, left untouched): routes/v1/index.ts imports FileSystemRawEventStore which was renamed to FileRawEventStore; controllers/v1/invoices.ts has missing imports; several migration version-mismatch failures; webhookQueue.persist.test.ts fails identically on the clean base. Because of these, any test importing the full app.ts won't compile, so I mounted the health router in isolation in my test (verifying the same wiring app.ts uses). The full suite shows 31 pre-existing failures; my suite is green and none of the failures touch files I changed. I did not fix these since you asked me to stay on task.
The two test-rbac-.db files were SQLite artifacts the RBAC test wrote to backend/.data/ during the full run — I deleted them per your choice; they're regenerated each run. Consider gitignoring backend/.data/.db separately.
Docs went in backend/docs/health.md (where all other docs live) rather than a top-level docs/health.md.