fix(backend): retry duckdb health at boot + warn on unset DUCKDB_SERVICE_URL by schutera · Pull Request #193 · schutera/highfive

schutera · 2026-06-27T10:11:47Z

Symptom

The admin server-logs panel shows, seemingly as a live outage:

⚠ DuckDB service not reachable: TypeError: fetch failed … ECONNREFUSED 127.0.0.1:8000
    at async duckdbHealth (…/duckdbClient.js)
    at async bootstrap (…/server.js)

…even though duckdb-service is up and /health returns 200.

Root cause — a startup race, not an outage

bootstrap() in server.ts probes duckdbHealth() once, immediately at boot. The API and duckdb-service are started together (pm2/compose), so the probe fires ~1s before duckdb finishes binding its port → ECONNREFUSED. It's caught and non-fatal (the API serves anyway), but:

it's logged once at boot, and
the Admin-gated endpoint to retrieve server process logs (nginx/backend/image-service/duckdb) #171 in-memory log ring retains boot lines for the whole process lifetime, so that single stale warning sits at the top of the admin panel and reads like a current failure.

Verified on prod: highfive-api and duckdb-service last started 2026-06-26 23:00:24 / :25 UTC — 1 second apart; /health is 200 now, service has not restarted since.

Changes

Retry the boot health probe (server.ts) — bootstrap() now retries duckdbHealth() up to 10 × 500ms before logging "not reachable". It still starts serving regardless (the check is advisory). A normal boot resolves on attempt 1–2; a genuinely-down service still surfaces the warning after ~5s.
Warn when DUCKDB_SERVICE_URL is unset (duckdbClient.ts → server.ts) — mirrors the existing PORT-unset warning. Makes a real misconfiguration loud instead of silently pointing at the wrong port. Also trims / treats blank as unset.

Deliberately NOT changed

The http://127.0.0.1:8002 default is kept — 8002 is the documented Docker host-port mapping (duckdb-service is 8002:8000), correct for a backend running on the host against a composed duckdb. A bare-metal/pm2 box (duckdb directly on :8000) sets DUCKDB_SERVICE_URL explicitly, which ecosystem.config.js already does. Changing the default to 8000 would break the compose convention; the new startup warning addresses the silent-fallback footgun instead.

Verification

tsc --noEmit is clean for both changed files. (Other files in the tree report errors only from a stale @highfive/contracts build and a missing rotating-file-stream dep in this host's node_modules — pre-existing environment drift, unrelated to this change; CI installs/builds both.)

🤖 Generated with Claude Code

cofade · 2026-06-28T22:01:14Z

@schutera — senior-review + full local test sweep of this PR. Verdict: mergeable, but one real availability regression and a stale doc the PR touches should be fixed first. The core fix is sound and small; the .trim() + || rewrite is a genuine correctness improvement. Details below.

Test results — every layer I can run locally is green

Layer	Result	Notes
Backend vitest	✅ 216/216	Parallel run hit a Windows `tinypool` "Worker exited unexpectedly" flake; single-threaded (`--no-file-parallelism --pool=forks`) is clean. CI green too — not a real failure.
Backend `tsc` build + `--noEmit`	✅ clean	The PR body's "stale contracts / missing `rotating-file-stream`" caveat does not reproduce on a fresh checkout here — contracts built, deps present.
Homepage vitest	✅ 191/191	regression baseline
duckdb-service pytest	✅ 232/232
image-service pytest	✅ 97/97
ESP32-CAM native	✅ 291/291
All 9 CI checks	✅ pass

Findings (ranked, independently verified against the files — not just the diff)

P1 — should fix before merge

1. app.listen is delayed ~4.5 s when duckdb is genuinely down — a real availability regression. In server.ts the retry loop runs before app.listen. I reproduced it by running the built backend against a refused duckdb port:

21:56:31.773  WARN  ⚠ DuckDB service not reachable after 10 attempts (http://127.0.0.1:9)
21:56:31.777  INFO  🐝 HighFive Backend API listening on port 3999   ← 4ms AFTER the loop exhausts

So /api/health (and every route) connection-refuses for the whole ~4.5 s loop when duckdb is down. The old single probe bound the port in milliseconds. Since the probe is explicitly advisory ("we start serving regardless"), the fix is to app.listen first, then run the probe loop in the background — that erases this stall and finding #2's worst case.

2. duckdbHealth() has no fetch timeout. duckdbClient.ts does a bare fetch, while the other three DUCKDB_URL fetches in app.ts all use AbortSignal.timeout(15000). Against a duckdb host that accepts TCP but never answers (hung, not refused), await duckdbHealth() blocks indefinitely → the loop never advances → app.listen is never called and the backend never comes up. Pre-existing, but the PR's whole premise is "duckdb might not be ready," so it's the moment to add signal: AbortSignal.timeout(...).

3. Stale doc the PR touches. docs/05-building-block-view/backend.md (Operational notes) claims the backend "retries duckdb-service on startup with exponential backoff … it does not block." After this PR it's wrong on both counts: the delay is a constant 500 ms (no backoff), and per #1 it now does block. CLAUDE.md's mandatory doc-update table makes this the author's responsibility. Suggest: "10 fixed-interval (500 ms) retries, non-blocking" — and make "non-blocking" actually true via #1.

4. Zero tests for the new logic. The analogous PORT work was deliberately extracted into a pure helper (port.ts → resolvePort(envValue)) with tests/port-default.test.ts specifically so it's testable without triggering bootstrap(). This PR instead computes duckdbUrlFromDefault as a module-load-time constant (untestable without env juggling) and adds nothing. Extracting resolveDuckdbUrl(envValue) → {url, fromDefault} turns the unset/blank/set cases into three one-line assertions and pins the one genuine bug fix here.

P2 — nits / defer

5. health.ok is never checked in duckdbClient.ts — a 200 with {"ok":false} still logs "reachable". Pre-existing; flagging since the PR rewrote this exact branch.

6. The chapter-11 incident this warning-pattern guards against was actually triggered by IMAGE_SERVICE_URL falling back on the PM2 host — which still gets no unset-warning. Out of scope, but the warning currently lands on the env var that didn't cause the documented outage.

Genuinely fine — don't change

The ?? → .trim() + || rewrite is a correct upgrade: a blank DUCKDB_SERVICE_URL= used to become the literal fetch URL; now it falls back. || (not ??) is the right operator here.
Keeping the 127.0.0.1:8002 default is correct (documented 8002:8000 host mapping) and consistent with the portUnsetWarning precedent.

Manual test plan (what local tests can't cover — needs the Docker stack)

A — the fix: no stale "not reachable" after a normal boot

docker compose up --build -d
curl.exe http://localhost:3002/api/health
curl.exe -s -H "X-Admin-Key: hf_dev_key_2026" "http://localhost:3002/api/admin/logs?service=backend" | Select-String "DuckDB service"

Expect 🗄 DuckDB service reachable: {...} and no ⚠ DuckDB service not reachable.

B — the new unset-URL warning fires

docker compose run --rm -e DUCKDB_SERVICE_URL= --service-ports backend node dist/server.js

Expect [startup] DUCKDB_SERVICE_URL unset — defaulting to http://127.0.0.1:8002....

C — confirm the P1 delay (optional)

docker compose stop duckdb-service; docker compose restart backend
(Invoke-WebRequest http://localhost:3002/api/health -UseBasicParsing).StatusCode   # ~4.5s stall, then 200
docker compose start duckdb-service

Bottom line

Happy-path is correct (hence green CI), but #1 (listen first, probe in background) and #3 (fix the doc) shouldn't merge as-is. #2 and #4 are strongly recommended alongside. Full local sweep above is green.

_{🤖 Review generated with Claude Code (senior-reviewer + local test sweep).}

…B_SERVICE_URL The API and duckdb-service start together (pm2/compose), so the one-shot health probe in bootstrap() races the service binding its port. On every restart it logs '⚠ DuckDB service not reachable', which the in-memory log ring (#171) then keeps at the top of the admin server-logs panel for the whole process lifetime — looking like an outage when the service is fine. (Prod: API + duckdb-service started 1s apart; duckdb bound :8000 just after the probe -> ECONNREFUSED, logged once, non-fatal.) bootstrap(): retry duckdbHealth() up to 10x500ms before warning; still serves regardless (advisory). duckdbClient: keep the 8002 default (documented docker host-port mapping 8002:8000) but expose duckdbUrlFromDefault so server.ts warns when DUCKDB_SERVICE_URL is unset, mirroring the existing PORT-unset warning. Trims/treats blank as unset. Type-checked clean (tsc --noEmit) for both changed files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cofade force-pushed the fix/backend-duckdb-bootstrap-retry branch from 9690438 to 9818abf Compare June 29, 2026 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): retry duckdb health at boot + warn on unset DUCKDB_SERVICE_URL#193

fix(backend): retry duckdb health at boot + warn on unset DUCKDB_SERVICE_URL#193
schutera wants to merge 1 commit into
mainfrom
fix/backend-duckdb-bootstrap-retry

schutera commented Jun 27, 2026

Uh oh!

cofade commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

schutera commented Jun 27, 2026

Symptom

Root cause — a startup race, not an outage

Changes

Deliberately NOT changed

Verification

Uh oh!

cofade commented Jun 28, 2026

Test results — every layer I can run locally is green

Findings (ranked, independently verified against the files — not just the diff)

P1 — should fix before merge

P2 — nits / defer

Genuinely fine — don't change

Manual test plan (what local tests can't cover — needs the Docker stack)

Bottom line

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants