Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 58 additions & 22 deletions docs/07-deployment-view/production-runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@

> ⚠️ **Non-recommended legacy path.** This runbook covers only the
> Node backend (PM2) and the static frontend (Nginx-served). It does
> **not** describe how to deploy `image-service` and `duckdb-service`
> on bare metal — those would each need their own systemd service,
> shared filesystem volume, and reverse-proxy plumbing. The supported
> **not** cover the _initial_ bare-metal provisioning of `image-service`
> and `duckdb-service` (each needs its own pm2/systemd unit, shared
> filesystem volume, and reverse-proxy plumbing) — though ongoing
> **redeploys** of those Python services (dependency install + reload) are
> covered under [Updates & Redeployment](#updates--redeployment). The supported
> production path is **Docker Compose + host-Nginx**:
> [production-deployment.md](production-deployment.md). Use this PM2
> runbook only if Docker is not an option on the target host; expect
Expand Down Expand Up @@ -231,30 +233,64 @@ pm2 startup

## Updates & Redeployment

To deploy updates from the production branch:
> The host normally self-deploys via [`scripts/deploy.sh`](../../scripts/deploy.sh)
> (auto-deploy driver — pulls `main`, installs deps, rebuilds only what changed,
> reloads the affected pm2 apps, health-checks, rolls back on failure; the
> `highfive-deploy.timer` may be inactive). The manual steps below mirror what it
> does — use them for a hand-deploy or to recover. **Caveat:** a rollback restores
> the git tree and Node build artifacts, but a `pip install` that _upgraded_ a
> shared dependency (e.g. `numpy` → 2.x) is **not** reverted — pip upgrades are
> forward-only across a rollback.

The live PM2 stack is **four** apps, not just the backend: `highfive-api`
(Node, cluster), `duckdb-service` and `image-service` (Python, run on the
**system `python3` — no venv**), plus the Nginx-served `homepage/dist`.

```bash
cd /var/www/highfive

# Pull latest changes
git pull origin production

# Rebuild backend
cd backend
npm install --production
npm run build
cd ..

# Rebuild frontend
cd homepage
npm install --production
VITE_API_URL=https://highfive.schutera.com/api npm run build
cd ..

# Restart backend
pm2 restart highfive-api
git pull --ff-only origin main

# 1) Node deps — npm WORKSPACES monorepo, so install from the ROOT. A new
# backend/homepage dep lands in the ROOT package-lock.json; a per-package
# `npm --prefix <pkg> ci` misses it (that broke a deploy on
# rotating-file-stream, #178). Safe to skip if no package*.json changed.
npm ci

# 2) Python deps — install into the SAME system python3 pm2 runs the services
# with (no venv). Native deps whose wheel windows can't span the 3.10–3.14 CI
# matrix are floated to >= bounds (numpy>=2.0.0, onnxruntime>=1.23.2,
# pydantic>=2.12.5), so pip resolves a per-interpreter wheel — on this 3.10
# host that's onnxruntime 1.23.2 / numpy 2.x (ADR-029). image-service BOOTS
# without the hole-detection deps (detection degrades to a no-op, ADR-028),
# which is why the pip step is non-fatal in scripts/deploy.sh.
python3 -m pip install -r duckdb-service/requirements.txt
python3 -m pip install -r image-service/requirements.txt

# 3) Build the Node side (contracts is source-only — no build step)
npm --prefix backend run build
( cd homepage && VITE_API_URL=https://highfive.schutera.com/api npm run build )

# 4) Reload (zero-downtime for the api cluster) and health-check
pm2 reload highfive-api duckdb-service image-service
curl -fsS http://127.0.0.1:3001/api/health # backend
curl -fsS http://127.0.0.1:8000/health # duckdb-service
curl -fsS http://127.0.0.1:4444/health # image-service
curl -fsS -o /dev/null https://highfive.schutera.com/ && echo "homepage ok"
```

**Python 3.10 floor (not a pin).** The host's `python3` is 3.10, so the services
must stay 3.10-compatible (no `from datetime import UTC`, which is 3.11+). The CI
matrix runs them across **3.10–3.14**, so native deps whose wheel windows can't
span that range are floated to `>=` lower bounds rather than `==`-pinned —
`numpy>=2.0.0`, `onnxruntime>=1.23.2` (image-service) and `pydantic>=2.12.5` (both
services). pip then resolves the newest interpreter-compatible wheel per host: on
this 3.10 box that's `onnxruntime` 1.23.2 (its highest cp310 wheel) and `numpy`
2.x. Rationale and trade-offs (prod moves to numpy 2.x; looser reproducibility on
the floated deps) are in
[ADR-029](../09-architecture-decisions/adr-029-python-version-matrix-floated-pins.md).
All AI/ML inference is server-side — the ESP runs no models
([ADR-028](../09-architecture-decisions/adr-028-ml-inference-server-side-only.md)).

## Verification

### Check Backend is Running
Expand Down
8 changes: 8 additions & 0 deletions docs/11-risks-and-technical-debt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3485,3 +3485,11 @@ The seed value lives in the schema, has a plausible-looking name, and never wins
**What happened.** While adding the serial-console server override (#156), the highest-risk interaction was that `host.cpp`'s `saveConfig` built a fresh `StaticJsonDocument` containing only SSID/PASSWORD and wrote it over `/config.json`. The override writer (`esp_init.cpp` `writeServerUrlsToConfig`) writes `NETWORK.INIT_URL`/`UPLOAD_URL` into the **same file** — so any later Wi-Fi reconfigure through the captive portal would have silently erased the override, sending the module back to its baked default on the next boot.

**Lesson.** When two writers share one config file, a "build it fresh" writer is a latent data-loss bug the moment a _second_ writer adds keys it doesn't know about. Make every writer read-modify-write (preserve unknown keys), and factor the mutation into a host-tested pure function so the "preserve a key I don't own" invariant is pinned by a test (`test_wifi_save_preserves_existing_init_url`) rather than living only in a careful author's head. Bonus: computing the new JSON _before_ opening the file for `"w"` also closes the older #19 truncate-then-fail window — an overflow now leaves the existing file byte-for-byte intact instead of stranding an empty one.

### `scripts/deploy.sh` never installed new deps — per-package `npm ci` misses the root workspace lockfile (#178, #196)

**What happened.** The auto-deploy driver (`scripts/deploy.sh`) gated its npm install on `backend/package-lock.json` changing and ran `npm --prefix backend ci`. But this is an **npm workspaces** monorepo — `contracts`/`backend`/`homepage` share one **root** `package-lock.json` — so a new backend/homepage dependency changes the _root_ lockfile, not a per-package one. The gate never fired, `npm ci` was skipped, and `tsc`/`vite` then compiled against the missing dependency and rolled the deploy back. It first bit on `rotating-file-stream` (#178). The pip side was worse: it never ran at all, so a new Python dependency was never installed into the system `python3` before `pm2 reload`.

**Lesson.** In a workspaces monorepo there is exactly one authoritative lockfile — the **root** one — and `npm --prefix <pkg> ci` is the wrong command: it neither reads the root lockfile nor installs sibling workspaces. Gate on the root `package-lock.json` (or any workspace `package.json`) and run a single root `npm ci`, **before** the builds. For the PM2-host Python services (system `python3`, no venv, no lockfile), install `requirements.txt` explicitly — a `reload` alone never picks up a new dependency.

**Fix (#196).** `deploy.sh` runs one root `npm ci` gated on `^package-lock\.json$|^(backend|homepage|contracts)/package\.json$` before the Node builds, and `python3 -m pip install -r <svc>/requirements.txt` for each Python service whose `requirements.txt` changed. The pip step is deliberately **non-fatal**: a resolver miss (e.g. an `onnxruntime` with no wheel for the host interpreter) must not block the deploy, because image-service degrades to a detection no-op (ADR-028) and the post-reload health check is the real gate. One sharp edge remains: rollback restores the git tree + build artifacts but does **not** downgrade a pip-_upgraded_ shared dep, so a forward bump (e.g. `numpy` → 2.x via ADR-029's float) is sticky across a rollback.
27 changes: 25 additions & 2 deletions scripts/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -224,19 +224,42 @@ main() {
local actions="" HOMEPAGE_REBUILT=0

# ---- build phase (no live mutation) ---------------------------------------
# Workspaces monorepo (contracts/backend/homepage share ONE root lockfile): a
# new backend/homepage dependency changes the ROOT `package-lock.json`, not
# `<pkg>/package-lock.json`, so reinstall from the root — BEFORE the builds, or
# `tsc`/`vite` compile against missing deps and roll back (hit this with
# rotating-file-stream, #178). `npm ci` at root installs every workspace.
if changed_match '^package-lock\.json$|^(backend|homepage|contracts)/package\.json$'; then
log "npm deps changed — root npm ci (workspaces)"
npm ci >/dev/null 2>&1 || rollback "root npm ci failed"
fi
if changed_match '^backend/|^contracts/'; then
log "building backend"
changed_match '^backend/package-lock\.json$' && npm --prefix backend ci >/dev/null 2>&1
npm --prefix backend run build >/dev/null 2>&1 || rollback "backend build (tsc) failed"
actions+="backend "; RELOADED+="highfive-api "
fi
if changed_match '^homepage/|^contracts/'; then
log "building homepage -> dist.new"
changed_match '^homepage/package-lock\.json$' && npm --prefix homepage ci >/dev/null 2>&1
( cd homepage && npx tsc && npx vite build --outDir dist.new ) >/dev/null 2>&1 || rollback "homepage build failed"
[ -f "$REPO/homepage/dist.new/index.html" ] || rollback "homepage dist.new missing index.html"
HOMEPAGE_REBUILT=1; actions+="homepage "
fi
# Python services run under pm2 on the system `python3` (no venv) — install new
# deps into it BEFORE reload. NON-FATAL: a resolver miss (e.g. an onnxruntime
# with no wheel for this Python) must not block the deploy; the post-reload
# health check is the real gate (services degrade gracefully on a missing
# OPTIONAL dep, while a genuinely-required missing module crashes the reload →
# health fails → rollback).
if changed_match '^duckdb-service/requirements\.txt$'; then
log "duckdb-service deps changed — pip install"
python3 -m pip install -r duckdb-service/requirements.txt >/dev/null 2>&1 \
|| log "WARN: duckdb-service pip install had failures (health check will gate)"
fi
if changed_match '^image-service/requirements\.txt$'; then
log "image-service deps changed — pip install"
python3 -m pip install -r image-service/requirements.txt >/dev/null 2>&1 \
|| log "WARN: image-service pip install had failures (health check will gate)"
fi
changed_match '^duckdb-service/' && { RELOADED+="duckdb-service "; actions+="duckdb-service "; }
changed_match '^image-service/' && { RELOADED+="image-service "; actions+="image-service "; }

Expand Down
Loading