Skip to content

fix(backend): retry duckdb health at boot + warn on unset DUCKDB_SERVICE_URL#193

Open
schutera wants to merge 1 commit into
mainfrom
fix/backend-duckdb-bootstrap-retry
Open

fix(backend): retry duckdb health at boot + warn on unset DUCKDB_SERVICE_URL#193
schutera wants to merge 1 commit into
mainfrom
fix/backend-duckdb-bootstrap-retry

Conversation

@schutera

Copy link
Copy Markdown
Owner

Symptom

The admin server-logs panel shows, seemingly as a live outage:

⚠ DuckDB service not reachable: TypeError: fetch failed … ECONNREFUSED 127.0.0.1:8000
    at async duckdbHealth (…/duckdbClient.js)
    at async bootstrap (…/server.js)

…even though duckdb-service is up and /health returns 200.

Root cause — a startup race, not an outage

bootstrap() in server.ts probes duckdbHealth() once, immediately at boot. The API and duckdb-service are started together (pm2/compose), so the probe fires ~1s before duckdb finishes binding its port → ECONNREFUSED. It's caught and non-fatal (the API serves anyway), but:

Verified on prod: highfive-api and duckdb-service last started 2026-06-26 23:00:24 / :25 UTC — 1 second apart; /health is 200 now, service has not restarted since.

Changes

  1. Retry the boot health probe (server.ts) — bootstrap() now retries duckdbHealth() up to 10 × 500ms before logging "not reachable". It still starts serving regardless (the check is advisory). A normal boot resolves on attempt 1–2; a genuinely-down service still surfaces the warning after ~5s.
  2. Warn when DUCKDB_SERVICE_URL is unset (duckdbClient.tsserver.ts) — mirrors the existing PORT-unset warning. Makes a real misconfiguration loud instead of silently pointing at the wrong port. Also trims / treats blank as unset.

Deliberately NOT changed

  • The http://127.0.0.1:8002 default is kept8002 is the documented Docker host-port mapping (duckdb-service is 8002:8000), correct for a backend running on the host against a composed duckdb. A bare-metal/pm2 box (duckdb directly on :8000) sets DUCKDB_SERVICE_URL explicitly, which ecosystem.config.js already does. Changing the default to 8000 would break the compose convention; the new startup warning addresses the silent-fallback footgun instead.

Verification

  • tsc --noEmit is clean for both changed files. (Other files in the tree report errors only from a stale @highfive/contracts build and a missing rotating-file-stream dep in this host's node_modules — pre-existing environment drift, unrelated to this change; CI installs/builds both.)

🤖 Generated with Claude Code

@cofade

cofade commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

@schutera — senior-review + full local test sweep of this PR. Verdict: mergeable, but one real availability regression and a stale doc the PR touches should be fixed first. The core fix is sound and small; the .trim() + || rewrite is a genuine correctness improvement. Details below.

Test results — every layer I can run locally is green

Layer Result Notes
Backend vitest 216/216 Parallel run hit a Windows tinypool "Worker exited unexpectedly" flake; single-threaded (--no-file-parallelism --pool=forks) is clean. CI green too — not a real failure.
Backend tsc build + --noEmit clean The PR body's "stale contracts / missing rotating-file-stream" caveat does not reproduce on a fresh checkout here — contracts built, deps present.
Homepage vitest ✅ 191/191 regression baseline
duckdb-service pytest ✅ 232/232
image-service pytest ✅ 97/97
ESP32-CAM native ✅ 291/291
All 9 CI checks ✅ pass

Findings (ranked, independently verified against the files — not just the diff)

P1 — should fix before merge

1. app.listen is delayed ~4.5 s when duckdb is genuinely down — a real availability regression. In server.ts the retry loop runs before app.listen. I reproduced it by running the built backend against a refused duckdb port:

21:56:31.773  WARN  ⚠ DuckDB service not reachable after 10 attempts (http://127.0.0.1:9)
21:56:31.777  INFO  🐝 HighFive Backend API listening on port 3999   ← 4ms AFTER the loop exhausts

So /api/health (and every route) connection-refuses for the whole ~4.5 s loop when duckdb is down. The old single probe bound the port in milliseconds. Since the probe is explicitly advisory ("we start serving regardless"), the fix is to app.listen first, then run the probe loop in the background — that erases this stall and finding #2's worst case.

2. duckdbHealth() has no fetch timeout. duckdbClient.ts does a bare fetch, while the other three DUCKDB_URL fetches in app.ts all use AbortSignal.timeout(15000). Against a duckdb host that accepts TCP but never answers (hung, not refused), await duckdbHealth() blocks indefinitely → the loop never advances → app.listen is never called and the backend never comes up. Pre-existing, but the PR's whole premise is "duckdb might not be ready," so it's the moment to add signal: AbortSignal.timeout(...).

3. Stale doc the PR touches. docs/05-building-block-view/backend.md (Operational notes) claims the backend "retries duckdb-service on startup with exponential backoff … it does not block." After this PR it's wrong on both counts: the delay is a constant 500 ms (no backoff), and per #1 it now does block. CLAUDE.md's mandatory doc-update table makes this the author's responsibility. Suggest: "10 fixed-interval (500 ms) retries, non-blocking" — and make "non-blocking" actually true via #1.

4. Zero tests for the new logic. The analogous PORT work was deliberately extracted into a pure helper (port.tsresolvePort(envValue)) with tests/port-default.test.ts specifically so it's testable without triggering bootstrap(). This PR instead computes duckdbUrlFromDefault as a module-load-time constant (untestable without env juggling) and adds nothing. Extracting resolveDuckdbUrl(envValue) → {url, fromDefault} turns the unset/blank/set cases into three one-line assertions and pins the one genuine bug fix here.

P2 — nits / defer

5. health.ok is never checked in duckdbClient.ts — a 200 with {"ok":false} still logs "reachable". Pre-existing; flagging since the PR rewrote this exact branch.

6. The chapter-11 incident this warning-pattern guards against was actually triggered by IMAGE_SERVICE_URL falling back on the PM2 host — which still gets no unset-warning. Out of scope, but the warning currently lands on the env var that didn't cause the documented outage.

Genuinely fine — don't change

  • The ??.trim() + || rewrite is a correct upgrade: a blank DUCKDB_SERVICE_URL= used to become the literal fetch URL; now it falls back. || (not ??) is the right operator here.
  • Keeping the 127.0.0.1:8002 default is correct (documented 8002:8000 host mapping) and consistent with the portUnsetWarning precedent.

Manual test plan (what local tests can't cover — needs the Docker stack)

A — the fix: no stale "not reachable" after a normal boot

docker compose up --build -d
curl.exe http://localhost:3002/api/health
curl.exe -s -H "X-Admin-Key: hf_dev_key_2026" "http://localhost:3002/api/admin/logs?service=backend" | Select-String "DuckDB service"

Expect 🗄 DuckDB service reachable: {...} and no ⚠ DuckDB service not reachable.

B — the new unset-URL warning fires

docker compose run --rm -e DUCKDB_SERVICE_URL= --service-ports backend node dist/server.js

Expect [startup] DUCKDB_SERVICE_URL unset — defaulting to http://127.0.0.1:8002....

C — confirm the P1 delay (optional)

docker compose stop duckdb-service; docker compose restart backend
(Invoke-WebRequest http://localhost:3002/api/health -UseBasicParsing).StatusCode   # ~4.5s stall, then 200
docker compose start duckdb-service

Bottom line

Happy-path is correct (hence green CI), but #1 (listen first, probe in background) and #3 (fix the doc) shouldn't merge as-is. #2 and #4 are strongly recommended alongside. Full local sweep above is green.

🤖 Review generated with Claude Code (senior-reviewer + local test sweep).

…B_SERVICE_URL

The API and duckdb-service start together (pm2/compose), so the one-shot health probe in bootstrap() races the service binding its port. On every restart it logs '⚠ DuckDB service not reachable', which the in-memory log ring (#171) then keeps at the top of the admin server-logs panel for the whole process lifetime — looking like an outage when the service is fine. (Prod: API + duckdb-service started 1s apart; duckdb bound :8000 just after the probe -> ECONNREFUSED, logged once, non-fatal.)

bootstrap(): retry duckdbHealth() up to 10x500ms before warning; still serves regardless (advisory). duckdbClient: keep the 8002 default (documented docker host-port mapping 8002:8000) but expose duckdbUrlFromDefault so server.ts warns when DUCKDB_SERVICE_URL is unset, mirroring the existing PORT-unset warning. Trims/treats blank as unset.

Type-checked clean (tsc --noEmit) for both changed files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cofade cofade force-pushed the fix/backend-duckdb-bootstrap-retry branch from 9690438 to 9818abf Compare June 29, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants