From 144b836c800eb6b5109bfd113ac865c2e35d63e4 Mon Sep 17 00:00:00 2001 From: Mark Schutera Date: Sat, 27 Jun 2026 13:15:50 +0000 Subject: [PATCH 1/2] fix(deploy): install workspace npm + service pip deps in deploy.sh deploy.sh rebuilt/reloaded but never installed new deps, so a dep-adding release failed its build and auto-rolled-back. npm: it gated npm ci on backend/package-lock.json, but this is a workspaces monorepo with one ROOT lockfile, so new backend/homepage deps were missed (broke on rotating-file-stream, #178). Now a single root 'npm ci' gated on the root lockfile / any workspace package.json, before the builds; dropped the wrong per-prefix ci. pip: never ran. Now 'python3 -m pip install -r /requirements.txt' for duckdb-service/image-service when their requirements changed, into the system python3 pm2 uses; non-fatal, the post-reload health check is the real gate (graceful degradation on a missing optional dep). Also rewrites the stale production-runbook 'Updates & Redeployment' section to match reality (main branch, root npm ci, pip into system python3, all 4 pm2 apps, health checks, Python 3.10 / onnxruntime 1.23.2 note). Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/07-deployment-view/production-runbook.md | 69 +++++++++++++------ scripts/deploy.sh | 27 +++++++- 2 files changed, 72 insertions(+), 24 deletions(-) diff --git a/docs/07-deployment-view/production-runbook.md b/docs/07-deployment-view/production-runbook.md index 862707e..9dcd848 100644 --- a/docs/07-deployment-view/production-runbook.md +++ b/docs/07-deployment-view/production-runbook.md @@ -2,9 +2,11 @@ > ⚠️ **Non-recommended legacy path.** This runbook covers only the > Node backend (PM2) and the static frontend (Nginx-served). It does -> **not** describe how to deploy `image-service` and `duckdb-service` -> on bare metal — those would each need their own systemd service, -> shared filesystem volume, and reverse-proxy plumbing. The supported +> **not** cover the *initial* bare-metal provisioning of `image-service` +> and `duckdb-service` (each needs its own pm2/systemd unit, shared +> filesystem volume, and reverse-proxy plumbing) — though ongoing +> **redeploys** of those Python services (dependency install + reload) are +> covered under [Updates & Redeployment](#updates--redeployment). The supported > production path is **Docker Compose + host-Nginx**: > [production-deployment.md](production-deployment.md). Use this PM2 > runbook only if Docker is not an option on the target host; expect @@ -231,30 +233,53 @@ pm2 startup ## Updates & Redeployment -To deploy updates from the production branch: +> The host normally self-deploys via [`scripts/deploy.sh`](../../scripts/deploy.sh) +> (auto-deploy driver — pulls `main`, installs deps, rebuilds only what changed, +> reloads the affected pm2 apps, health-checks, rolls back on failure; the +> `highfive-deploy.timer` may be inactive). The manual steps below mirror what it +> does — use them for a hand-deploy or to recover. + +The live PM2 stack is **four** apps, not just the backend: `highfive-api` +(Node, cluster), `duckdb-service` and `image-service` (Python, run on the +**system `python3` — no venv**), plus the Nginx-served `homepage/dist`. ```bash cd /var/www/highfive - -# Pull latest changes -git pull origin production - -# Rebuild backend -cd backend -npm install --production -npm run build -cd .. - -# Rebuild frontend -cd homepage -npm install --production -VITE_API_URL=https://highfive.schutera.com/api npm run build -cd .. - -# Restart backend -pm2 restart highfive-api +git pull --ff-only origin main + +# 1) Node deps — npm WORKSPACES monorepo, so install from the ROOT. A new +# backend/homepage dep lands in the ROOT package-lock.json; a per-package +# `npm --prefix ci` misses it (that broke a deploy on +# rotating-file-stream, #178). Safe to skip if no package*.json changed. +npm ci + +# 2) Python deps — install into the SAME system python3 pm2 runs the services +# with (no venv). duckdb-service deps are all pure / cp310-ok. image-service +# adds the hole-detection deps (opencv-python-headless, numpy, onnxruntime); +# the host is Python 3.10, so onnxruntime is pinned to 1.23.2 (max cp310 +# wheel). image-service BOOTS without these (detection degrades to no-op, +# ADR-028), so this step is only needed to ACTIVATE server-side detection. +python3 -m pip install -r duckdb-service/requirements.txt +python3 -m pip install -r image-service/requirements.txt + +# 3) Build the Node side (contracts is source-only — no build step) +npm --prefix backend run build +( cd homepage && VITE_API_URL=https://highfive.schutera.com/api npm run build ) + +# 4) Reload (zero-downtime for the api cluster) and health-check +pm2 reload highfive-api duckdb-service image-service +curl -fsS http://127.0.0.1:3001/api/health # backend +curl -fsS http://127.0.0.1:8000/health # duckdb-service +curl -fsS http://127.0.0.1:4444/health # image-service +curl -fsS -o /dev/null https://highfive.schutera.com/ && echo "homepage ok" ``` +**Python 3.10 ceiling.** The host's `python3` is 3.10, so the services must stay +3.10-compatible (no `from datetime import UTC`, which is 3.11+) and `onnxruntime` +stays pinned to `1.23.2` (the highest version with a CPython 3.10 wheel). All +AI/ML inference is server-side — the ESP runs no models +([ADR-028](../09-architecture-decisions/adr-028-ml-inference-server-side-only.md)). + ## Verification ### Check Backend is Running diff --git a/scripts/deploy.sh b/scripts/deploy.sh index 459cf27..b716520 100755 --- a/scripts/deploy.sh +++ b/scripts/deploy.sh @@ -224,19 +224,42 @@ main() { local actions="" HOMEPAGE_REBUILT=0 # ---- build phase (no live mutation) --------------------------------------- + # Workspaces monorepo (contracts/backend/homepage share ONE root lockfile): a + # new backend/homepage dependency changes the ROOT `package-lock.json`, not + # `/package-lock.json`, so reinstall from the root — BEFORE the builds, or + # `tsc`/`vite` compile against missing deps and roll back (hit this with + # rotating-file-stream, #178). `npm ci` at root installs every workspace. + if changed_match '^package-lock\.json$|^(backend|homepage|contracts)/package\.json$'; then + log "npm deps changed — root npm ci (workspaces)" + npm ci >/dev/null 2>&1 || rollback "root npm ci failed" + fi if changed_match '^backend/|^contracts/'; then log "building backend" - changed_match '^backend/package-lock\.json$' && npm --prefix backend ci >/dev/null 2>&1 npm --prefix backend run build >/dev/null 2>&1 || rollback "backend build (tsc) failed" actions+="backend "; RELOADED+="highfive-api " fi if changed_match '^homepage/|^contracts/'; then log "building homepage -> dist.new" - changed_match '^homepage/package-lock\.json$' && npm --prefix homepage ci >/dev/null 2>&1 ( cd homepage && npx tsc && npx vite build --outDir dist.new ) >/dev/null 2>&1 || rollback "homepage build failed" [ -f "$REPO/homepage/dist.new/index.html" ] || rollback "homepage dist.new missing index.html" HOMEPAGE_REBUILT=1; actions+="homepage " fi + # Python services run under pm2 on the system `python3` (no venv) — install new + # deps into it BEFORE reload. NON-FATAL: a resolver miss (e.g. an onnxruntime + # with no wheel for this Python) must not block the deploy; the post-reload + # health check is the real gate (services degrade gracefully on a missing + # OPTIONAL dep, while a genuinely-required missing module crashes the reload → + # health fails → rollback). + if changed_match '^duckdb-service/requirements\.txt$'; then + log "duckdb-service deps changed — pip install" + python3 -m pip install -r duckdb-service/requirements.txt >/dev/null 2>&1 \ + || log "WARN: duckdb-service pip install had failures (health check will gate)" + fi + if changed_match '^image-service/requirements\.txt$'; then + log "image-service deps changed — pip install" + python3 -m pip install -r image-service/requirements.txt >/dev/null 2>&1 \ + || log "WARN: image-service pip install had failures (health check will gate)" + fi changed_match '^duckdb-service/' && { RELOADED+="duckdb-service "; actions+="duckdb-service "; } changed_match '^image-service/' && { RELOADED+="image-service "; actions+="image-service "; } From 135dc1f83c886074e8049cbc45ad346f98ab1b8c Mon Sep 17 00:00:00 2001 From: cofade Date: Mon, 29 Jun 2026 21:02:34 +0200 Subject: [PATCH 2/2] docs(deploy): align runbook Python-deps section with floated pins (ADR-029) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The PR's runbook rewrite asserted onnxruntime is "pinned to 1.23.2" under a "Python 3.10 ceiling". After folding in main (#195 / ADR-029), the real requirements float numpy>=2.0.0 / onnxruntime>=1.23.2 / pydantic>=2.12.5 for a 3.10-3.14 matrix — a floor, not a pin. Rewrote the step-2 comment and the "Python 3.10 floor" paragraph to match, citing ADR-029, and noted that a pip upgrade is not reverted on rollback. Added a ch11 lessons-learned entry for the workspace-lockfile npm-ci miss this PR corrects (per CLAUDE.md's mandatory doc gate). Addresses the senior-reviewer P0/P2 findings on the PR. Co-Authored-By: Claude Opus 4.8 --- docs/07-deployment-view/production-runbook.md | 33 ++++++++++++------- docs/11-risks-and-technical-debt/README.md | 8 +++++ 2 files changed, 30 insertions(+), 11 deletions(-) diff --git a/docs/07-deployment-view/production-runbook.md b/docs/07-deployment-view/production-runbook.md index 9dcd848..c8060fc 100644 --- a/docs/07-deployment-view/production-runbook.md +++ b/docs/07-deployment-view/production-runbook.md @@ -2,7 +2,7 @@ > ⚠️ **Non-recommended legacy path.** This runbook covers only the > Node backend (PM2) and the static frontend (Nginx-served). It does -> **not** cover the *initial* bare-metal provisioning of `image-service` +> **not** cover the _initial_ bare-metal provisioning of `image-service` > and `duckdb-service` (each needs its own pm2/systemd unit, shared > filesystem volume, and reverse-proxy plumbing) — though ongoing > **redeploys** of those Python services (dependency install + reload) are @@ -237,7 +237,10 @@ pm2 startup > (auto-deploy driver — pulls `main`, installs deps, rebuilds only what changed, > reloads the affected pm2 apps, health-checks, rolls back on failure; the > `highfive-deploy.timer` may be inactive). The manual steps below mirror what it -> does — use them for a hand-deploy or to recover. +> does — use them for a hand-deploy or to recover. **Caveat:** a rollback restores +> the git tree and Node build artifacts, but a `pip install` that _upgraded_ a +> shared dependency (e.g. `numpy` → 2.x) is **not** reverted — pip upgrades are +> forward-only across a rollback. The live PM2 stack is **four** apps, not just the backend: `highfive-api` (Node, cluster), `duckdb-service` and `image-service` (Python, run on the @@ -254,11 +257,12 @@ git pull --ff-only origin main npm ci # 2) Python deps — install into the SAME system python3 pm2 runs the services -# with (no venv). duckdb-service deps are all pure / cp310-ok. image-service -# adds the hole-detection deps (opencv-python-headless, numpy, onnxruntime); -# the host is Python 3.10, so onnxruntime is pinned to 1.23.2 (max cp310 -# wheel). image-service BOOTS without these (detection degrades to no-op, -# ADR-028), so this step is only needed to ACTIVATE server-side detection. +# with (no venv). Native deps whose wheel windows can't span the 3.10–3.14 CI +# matrix are floated to >= bounds (numpy>=2.0.0, onnxruntime>=1.23.2, +# pydantic>=2.12.5), so pip resolves a per-interpreter wheel — on this 3.10 +# host that's onnxruntime 1.23.2 / numpy 2.x (ADR-029). image-service BOOTS +# without the hole-detection deps (detection degrades to a no-op, ADR-028), +# which is why the pip step is non-fatal in scripts/deploy.sh. python3 -m pip install -r duckdb-service/requirements.txt python3 -m pip install -r image-service/requirements.txt @@ -274,10 +278,17 @@ curl -fsS http://127.0.0.1:4444/health # image-service curl -fsS -o /dev/null https://highfive.schutera.com/ && echo "homepage ok" ``` -**Python 3.10 ceiling.** The host's `python3` is 3.10, so the services must stay -3.10-compatible (no `from datetime import UTC`, which is 3.11+) and `onnxruntime` -stays pinned to `1.23.2` (the highest version with a CPython 3.10 wheel). All -AI/ML inference is server-side — the ESP runs no models +**Python 3.10 floor (not a pin).** The host's `python3` is 3.10, so the services +must stay 3.10-compatible (no `from datetime import UTC`, which is 3.11+). The CI +matrix runs them across **3.10–3.14**, so native deps whose wheel windows can't +span that range are floated to `>=` lower bounds rather than `==`-pinned — +`numpy>=2.0.0`, `onnxruntime>=1.23.2` (image-service) and `pydantic>=2.12.5` (both +services). pip then resolves the newest interpreter-compatible wheel per host: on +this 3.10 box that's `onnxruntime` 1.23.2 (its highest cp310 wheel) and `numpy` +2.x. Rationale and trade-offs (prod moves to numpy 2.x; looser reproducibility on +the floated deps) are in +[ADR-029](../09-architecture-decisions/adr-029-python-version-matrix-floated-pins.md). +All AI/ML inference is server-side — the ESP runs no models ([ADR-028](../09-architecture-decisions/adr-028-ml-inference-server-side-only.md)). ## Verification diff --git a/docs/11-risks-and-technical-debt/README.md b/docs/11-risks-and-technical-debt/README.md index 84dae45..db635d0 100644 --- a/docs/11-risks-and-technical-debt/README.md +++ b/docs/11-risks-and-technical-debt/README.md @@ -3485,3 +3485,11 @@ The seed value lives in the schema, has a plausible-looking name, and never wins **What happened.** While adding the serial-console server override (#156), the highest-risk interaction was that `host.cpp`'s `saveConfig` built a fresh `StaticJsonDocument` containing only SSID/PASSWORD and wrote it over `/config.json`. The override writer (`esp_init.cpp` `writeServerUrlsToConfig`) writes `NETWORK.INIT_URL`/`UPLOAD_URL` into the **same file** — so any later Wi-Fi reconfigure through the captive portal would have silently erased the override, sending the module back to its baked default on the next boot. **Lesson.** When two writers share one config file, a "build it fresh" writer is a latent data-loss bug the moment a _second_ writer adds keys it doesn't know about. Make every writer read-modify-write (preserve unknown keys), and factor the mutation into a host-tested pure function so the "preserve a key I don't own" invariant is pinned by a test (`test_wifi_save_preserves_existing_init_url`) rather than living only in a careful author's head. Bonus: computing the new JSON _before_ opening the file for `"w"` also closes the older #19 truncate-then-fail window — an overflow now leaves the existing file byte-for-byte intact instead of stranding an empty one. + +### `scripts/deploy.sh` never installed new deps — per-package `npm ci` misses the root workspace lockfile (#178, #196) + +**What happened.** The auto-deploy driver (`scripts/deploy.sh`) gated its npm install on `backend/package-lock.json` changing and ran `npm --prefix backend ci`. But this is an **npm workspaces** monorepo — `contracts`/`backend`/`homepage` share one **root** `package-lock.json` — so a new backend/homepage dependency changes the _root_ lockfile, not a per-package one. The gate never fired, `npm ci` was skipped, and `tsc`/`vite` then compiled against the missing dependency and rolled the deploy back. It first bit on `rotating-file-stream` (#178). The pip side was worse: it never ran at all, so a new Python dependency was never installed into the system `python3` before `pm2 reload`. + +**Lesson.** In a workspaces monorepo there is exactly one authoritative lockfile — the **root** one — and `npm --prefix ci` is the wrong command: it neither reads the root lockfile nor installs sibling workspaces. Gate on the root `package-lock.json` (or any workspace `package.json`) and run a single root `npm ci`, **before** the builds. For the PM2-host Python services (system `python3`, no venv, no lockfile), install `requirements.txt` explicitly — a `reload` alone never picks up a new dependency. + +**Fix (#196).** `deploy.sh` runs one root `npm ci` gated on `^package-lock\.json$|^(backend|homepage|contracts)/package\.json$` before the Node builds, and `python3 -m pip install -r /requirements.txt` for each Python service whose `requirements.txt` changed. The pip step is deliberately **non-fatal**: a resolver miss (e.g. an `onnxruntime` with no wheel for the host interpreter) must not block the deploy, because image-service degrades to a detection no-op (ADR-028) and the post-reload health check is the real gate. One sharp edge remains: rollback restores the git tree + build artifacts but does **not** downgrade a pip-_upgraded_ shared dep, so a forward bump (e.g. `numpy` → 2.x via ADR-029's float) is sticky across a rollback.