Skip to content

feat: live structured server-log console (closes #178)#180

Merged
cofade merged 16 commits into
mainfrom
claude/issue-178-analysis-f3tstv
Jun 19, 2026
Merged

feat: live structured server-log console (closes #178)#180
cofade merged 16 commits into
mainfrom
claude/issue-178-analysis-f3tstv

Conversation

@cofade

@cofade cofade commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Turns the admin Server Logs panel (/admin → Server Logs) from a ~2-line static boot-banner dump into a genuine live operations console. The panel UI was already fine — the real gap was the data source: the services barely wrote to stdout, the ring was in-memory only, and lines were unstructured. Per the locked decisions in #178: structured timestamped entries, per-request access logging, disk persistence that survives restart (30 days / 100 MB), SSE live tail, and client-side search/filter/export — across all three services.

Delivered as 5 reviewable commits (one per phase) plus review-fix commits, all in this single PR. Architecture + rationale: ADR-022.

What changed

Contract (defined once, in contracts/): ServerLogsResponse.lines: string[]entries: LogEntry[] ({ ts, level, msg }), plus LogLevel. The same LogEntry is the SSE event payload — no separate type.

Backend + both Flask services (symmetric):

  • A structured logger and the existing stdout/stderr tee feed one ring with no double-capture (the logger writes via the saved original stream to bypass the tee).
  • Per-request access logging (method path status ms, level by status) — path-only, never headers/body/query, so the admin key, the login password, and ?token= values can't reach the admin-readable (and now disk-persisted) ring.
  • Disk persistence gated on LOG_DIR: JSONL, rotate daily-or-50 MB, retain ≤30 files and ≤100 MB total, with startup backfill so history survives a restart. Node uses rotating-file-stream; Flask uses a custom TimedRotatingFileHandler subclass (overridden doRollover so size-rolls actually bound the active file).
  • SSE live tail GET /api/admin/logs/stream — backend streams its own ring or pipes the Flask services' internal /logs/stream.

Frontend (ServerLogsPanel.tsx): REST backfill → EventSource live tail with follow-mode + live indicator; rows color-coded by level; a search box, level checkboxes, and a client-side Download .log export.

Docs: ADR-022 (supersedes ADR-021's in-memory consequence), api-reference + api-contracts, auth.md (disk-persisted-secret caveat), chapter 11 (retention/PII + per-process SSE), compose volumes + host-nginx SSE no-buffer.

Side-fix

The misleading http://localhost:3001 boot banner now reads listening on port N (all interfaces) — it never was localhost-only (issue comment 2).

Testing

  • backend 200 · homepage 169 · duckdb 204 · image 79 — all green; tsc/ruff/build/make check-citations clean.
  • New tests at every layer: structured ring + no-double-capture, access-log redaction (asserts no secret/header/query leaks), disk persistence + rotation (asserts the active file stays bounded under sustained writes and both retention bounds hold), SSE (real-server backend stream, abort-on-disconnect regression, Flask stream), component (live append, filter, export), and a Playwright spec asserting a real request streams into the panel live and exports.
  • A test enforces the two Flask log_ring.py files stay byte-identical (it caught a real drift mid-development).
  • Reviewed across three senior-review passes — each caught a real defect (an SSE-disconnect process crash and a rotation bug that let the active file grow unbounded); final verdict: mergeable.
  • ⚠️ Playwright (make test-ui) was not run in the dev sandbox — relies on the CI test-ui job.

Open considerations

  • Multi-worker prod: the on-disk file is shared but the SSE emitter is per-process — a future gunicorn/PM2-cluster backend would stream only the serving worker's live entries (history via disk stays complete). Noted in chapter 11.
  • PII / retention: 30-day persisted access logs are closer to an audit log than ADR-021's "recent tail" — request paths carry no PII today (MACs + module ids); retention is documented in ADR-022.
  • Secret-on-disk: persistence hardens the "never print/console.log secrets" rule — access logs are path-only and the dev-key banner bypasses the ring via writeStdout; called out in ADR-022 + auth.md.

🤖 Generated with Claude Code

https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK


Generated by Claude Code

claude added 10 commits June 18, 2026 21:18
Reshape the admin Server Logs pipeline from raw `lines: string[]` to
structured `entries: LogEntry[]` ({ ts, level, msg }) end-to-end, with no
new data sources yet (Phase 1 of #178).

- contracts: add LogLevel + LogEntry; ServerLogsResponse now carries
  `entries` instead of `lines` (wire type lives only in contracts/).
- backend: logRing holds LogEntry[] via two no-double-capture paths
  (stdout/stderr tee fallback + pushEntry); new log.ts structured logger
  writes through the saved original stream so the tee never re-captures it;
  server.ts banners go through the logger; the misleading
  `http://localhost:PORT` boot banner now states the port + all-interfaces.
- duckdb-service + image-service: symmetric log_ring upgrade (entry dicts,
  level-aware tee, log_event) and /logs returns `entries`.
- homepage: ServerLogsPanel renders timestamp + level badge + msg,
  color-coded by level.
- tests + api-reference/api-contracts docs updated for the new wire shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Senior-review follow-ups (no behavior change beyond the boot banners):

- Wire a real production caller for the Python structured logger: both Flask
  services emit a boot banner via log_event() at import, so the structured
  ingestion path is exercised in production, not just scaffolding.
- Rewrite the Python no-double-capture tests to run in the installed state
  (tee on sys.stdout), so they actually fail if log_event wrote through the
  tee instead of the saved real stream.
- Keep the dev admin key out of the admin-readable ring: server.ts writes the
  dev-key convenience lines via the writeStdout bypass (terminal only).
- contracts: correct the entries comment — in-memory in Phase 1; on-disk
  persistence is ADR-022/Phase 2.
- backend proxy: also validate `truncated` is a boolean before forwarding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Re-review follow-ups (test/comment accuracy only, no production change):

- no-double-capture test: assert the ring holds exactly one entry after a
  single log_event in the installed (tee-on-stdout) state — the unambiguous
  discriminator. The prior raw-substring count could not catch a buggy write
  through the tee (the re-captured entry holds the formatted line, a different
  string); comment corrected to say so.
- backend proxy: add a case pinning the new `truncated` boolean guard branch
  (valid service + entries but non-boolean truncated → 502).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Every handled request now emits one structured access entry through the
Phase-1 logger, so the admin Server Logs panel fills with real traffic (on
success too), visible per service:

- backend: new accessLog.ts Express middleware mounted after cookieParser;
  on res 'finish' emits "method path status ms" via log.* (>=500 error,
  >=400 warn, else info).
- duckdb-service + image-service: symmetric @app.before_request /
  @app.after_request hooks emitting the same shape via log_event.

Security (load-bearing): logs req.path ONLY — never headers, body, or query
string — so the X-Admin-Key header, the admin login password, and any
?token=/?key= value cannot reach the admin-readable ring. Tests assert a
real request lands as an entry at the right level AND that no secret/query/
header substring appears.

Docs: api-reference notes the access entry + path-only redaction; ADR-021's
now-stale "dev key appears in service=backend" consequence corrected (Phase 1
moved it to the ring-bypassing writeStdout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
…178 phase 3)

Server logs now survive restart, bounded to 30 days / 100 MB (ADR-022,
supersedes ADR-021's in-memory consequence). Gated on LOG_DIR (unset =
in-memory only, as before).

- backend: logRing.ts gains initLogPersistence — appends each entry as JSONL
  via rotating-file-stream (daily + 50MB rotate; retain <=30 files AND
  <=100MB total) and backfills the ring from the active file tail at startup.
  Wired in server.ts before the boot banners.
- duckdb-service + image-service: byte-identical log_ring.py init_persistence
  using stdlib TimedRotatingFileHandler subclassed with a 100MB size-prune
  sweep; same JSONL + backfill. Distinct LOG_DIR per service (subdirs of the
  shared duckdb_data volume) so they never collide on one file.
- compose (dev + prod): LOG_DIR env + a backend_logs volume; Flask services
  persist under /data/logs/<svc>.
- ADR-022 written + indexed; ADR-021 marked superseded-in-part. Docs: auth.md
  (disk-persisted-secret caveat), docker-compose.md (log volumes), ch.11
  (30-day retention/PII + per-process SSE emitter trade-offs).

Tests: JSONL append, startup backfill (survives restart), no-op without
LOG_DIR, malformed-line skip — backend + both Flask. Gate green: backend 191,
duckdb 198, image 74, homepage 161; tsc/ruff/citations clean; compose validates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
The panel now streams new entries live after the REST backfill — "tail -f"
with follow-mode auto-scroll — across all three services.

- Ring pub/sub at the existing choke points: backend logRing.ts emits from
  pushEntryInternal to a Set of subscribers (subscribeEntries); Flask _push
  fans entries out to per-connection bounded queues (subscribe/unsubscribe),
  drop-on-full so the logger never blocks.
- Backend GET /api/admin/logs/stream (requireAdmin): streams its own ring for
  `backend`; for the Flask services it awaits the upstream fetch first (connect
  failure still 502s) then pipes their /logs/stream through. Sets
  X-Accel-Buffering: no + a 25s keepalive; cleans up on client close. New
  logStream.ts helper keeps the route thin.
- Flask /logs/stream (both services, X-Admin-Key gated): stream_with_context
  generator yielding one LogEntry per data: event + keepalives, unsubscribe in
  finally on disconnect.
- Frontend: api.streamServerLogs() -> EventSource(withCredentials); panel does
  REST backfill then opens the stream, appends live entries (capped), follow
  mode with "Jump to latest", and a live/connecting indicator. EventSource
  stub added to test-setup for jsdom.
- Docs: host-nginx proxy_buffering off block, api-reference + api-contracts SSE
  endpoint. No wire-shape change — the SSE event payload is one LogEntry.

Tests: ring subscribe/unsubscribe; backend SSE endpoint over a real ephemeral
server (backend ring) + mocked ReadableStream pipe + 502 on connect-fail; Flask
subscribe/route stream; component live-append + close-on-unmount/service-change.
Gate green: backend 197, homepage 164, duckdb 201, image 77; tsc/ruff/build/
check-citations clean. (Full live-stream E2E is the Phase 5 Playwright spec.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
… phase 5)

Makes the live console usable at volume — all client-side, no endpoint or
contract changes.

- Panel: a search box (case-insensitive substring over msg) + three level
  checkboxes drive a derived visibleEntries; live-append, follow-mode, and
  "Jump to latest" keep working off the full buffer. "Download .log" builds a
  client-side <ts> <LEVEL> <msg> blob and downloads it — all loaded entries
  when unfiltered, the filtered view when a search/level filter is active
  (per the agreed scope); button disables when nothing would export. Adds a
  "No entries match the filter" state.
- Component tests: search narrows, level toggle hides, no-match disables
  export, and export blob content for both scopes (Blob-constructor spy since
  jsdom Blob has no .text()).
- Playwright (tests/ui/admin-server-logs.spec.ts): the end-to-end proof — wait
  for the live indicator, fire a uniquely-pathed backend request, assert its
  access-log entry STREAMS into the panel with no refresh, search-filter to it,
  and assert a real .log download with structured lines. UI stack is in-memory
  (no LOG_DIR) which is fine for a live assertion.
- Docs: api-reference notes the panel's live indicator + search/filter/export.

Gate: homepage 169 (0 errors), tsc + build clean, UI spec typechecks,
check-citations clean; backend/Flask unchanged from phase 4 (197/201/77).
Playwright runs under the existing `make test-ui` CI job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
…unds

P0 — backend process crash on every Flask-service live-tail disconnect:
the proxy piped Readable.fromWeb(upstream.body) with a bare .pipe(), so the
AbortError raised when the client disconnects (controller.abort()) was an
unhandled stream 'error'. Switched to stream.pipeline() with an error callback
that ignores the expected abort. Added a regression test that opens the proxied
stream, aborts mid-stream, and asserts the process is still serving.

P1 — rotation bounds were untested and asymmetric:
- Flask now also rolls the active file at 50 MB (_PruningTimedHandler.shouldRollover
  size check), matching the Node rfs size:'50M' trigger, so the 100 MB total
  bound holds continuously instead of only at the daily boundary.
- Behavioural tests (both Flask services): _prune_by_size evicts the oldest
  rotated file and never the active one; the active file rolls over once it
  passes the size cap. Node: a config test pins maxFiles:30 + maxSize:'100M' +
  size:'50M' (eviction itself is rfs's tested behaviour).

P2 + hardening:
- log_event now uses the entry returned by _push instead of reading _ring[-1]
  outside the lock (correct ts under concurrency).
- streamBackendRing guards res.write against a dead socket and tears down.
- Panel comments the SSE reconnect-gap behaviour.
- New test asserts the two services' log_ring.py stay byte-identical (it already
  caught a real drift while writing this commit); ADR-022 updated to match.

Gate: backend 200, homepage 169, duckdb 204, image 79; tsc/ruff/check-citations clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
…iew P0)

The previous commit's size trigger fired but stdlib TimedRotatingFileHandler.
doRollover collides on the same-day dated filename (service.log.YYYY-MM-DD) and
refuses to rotate a second time the same day — so after the first size-roll the
active file grew unbounded (reproduced: 200-byte cap → 184 KB active file), and
_prune_by_size (which only runs inside doRollover) stopped enforcing the 100 MB
total too. This shipped into the shared duckdb_data volume.

Fully override doRollover: rename to a UNIQUE timestamped file
(service.log.YYYY-MM-DD_HH-MM-SS[.idx]) on every trigger, recompute the next
daily boundary, reopen, then prune by count (<=30) and size (<=100 MB). New
_prune_by_count helper. shouldRollover no longer delegates the time check to
the broken super().

The rollover test now asserts the real property: under 2000 sustained writes at
a 500-byte cap the ACTIVE file stays < 4x the cap (not the ~100 KB written),
multiple distinct rotated files exist, and the count bound holds. Verified the
test fails against the old broken doRollover (184 KB active). Both Flask
services stay byte-identical (symmetry guard green).

Gate: duckdb 204, image 79 (backend 200, homepage 169 unchanged); ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
… re-review P2)

- _prune_by_size: comment the _MAX_FILE_BYTES < _MAX_TOTAL_BYTES invariant that
  makes skipping the active file safe (the active file alone can't exceed the
  total cap, and a size-roll always moves its bytes into an evictable file).
- rollover test: assert the rotated-file count reaches exactly _MAX_BACKUP_FILES
  (the count bound the scenario actually exercises), not just ">= 2".

Both Flask services stay byte-identical. duckdb 204, image 79; ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
cofade and others added 6 commits June 19, 2026 23:39
Resolve the ADR-number collision: main shipped
adr-022-build-time-feature-flags (#177) while this branch added its own
ADR-022. Renumber this branch''s persistent-structured-server-logs ADR to
ADR-023 — rename the file, fix the ADR index + the ADR-021 supersede
pointer, and update all 25 persistent-logs references (code comments,
docker-compose, docs) from ADR-022 to ADR-023. Feature-flag ADR-022
references are left untouched. docker-compose.yml and chapter 11
auto-merged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The `<select>` for backend/duckdb-service/image-service inherited a
near-white text color, rendering the selected service almost invisible on
the white panel header. Pin it to `text-gray-800` to match the panel
heading.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two branches both claimed ADR-022 (build-time-feature-flags and
persistent-structured-server-logs) and collided on merge. Record the
convention in the ADR index intro so the next contributor knows the
later-merged branch renumbers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ADR-022->023 renumber for the server-logs ADR accidentally flipped two
feature-flag references that share the same string (the renumber review
filter hid lines containing VITE_): docker-compose.yml''s dev feature-flag
comment and a chapter-11 link. Both point at build-time-feature-flags
(ADR-022); restore them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…en log invariants

- Pin `threaded=True` explicitly on both Flask `app.run` calls (already the
  Flask default, but load-bearing): the SSE live tail holds a worker for the
  stream''s lifetime, so concurrent request handling must be guaranteed or an
  open admin tail would stall uploads/reads. Documented in ADR-023 Negative
  and chapter 11 (with the empirical /health-stays-fast check).
- Validate `level` on disk backfill (backend + both byte-identical Flask
  log_ring.py) so a malformed/partial line missing `level` is skipped rather
  than admitted and saved only by the renderer''s fallback — the contract
  requires `level`.
- Tighten the active-file size assertion from `< 4x cap` to `< cap + 256`
  (one entry) in both Flask test_logs.py, pinning what the comment claims.
- Fix contradictory phase labels: disk persistence is Phase 3, not Phase 2
  (access logging is Phase 2) — contracts/index.ts + backend/logRing.ts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per senior review: comment that the disk write and SSE fan-out are
deliberately outside _lock so the next contributor doesn''t fix the
sub-millisecond ordering by widening the lock over the blocking disk
write — which would serialize every logging call across request threads.
Applied identically to both byte-identical Flask log_ring.py files.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cofade cofade marked this pull request as ready for review June 19, 2026 22:05
@cofade cofade merged commit b4f3648 into main Jun 19, 2026
9 checks passed
@schutera

Copy link
Copy Markdown
Owner

⚠️ Heads-up for contributors (cc @cofade) — a commit from this PR (8938899) introduced a Python 3.11-only API that broke production.

services/log_ring.py (both the duckdb-service and image-service copies) uses:

from datetime import UTC, datetime
...
datetime.now(UTC)

datetime.UTC was only added in Python 3.11. Our production host runs Python 3.10, and since log_ring.install() runs at import time in app.py, deploying current main (aac9c78) crashed both python services with an ImportError before they could serve traffic.

Fix is up in #191 — it swaps UTCtimezone.utc, which is identical behavior and works on 3.10 (stamp output verified byte-identical on 3.10.12). Prod is being held on the prior known-good revision until #191 merges.

Nothing for you to do here — just flagging so the 3.11-ism doesn't get reintroduced. For the 3.10 target, datetime.now(timezone.utc) is the safe form. A CI lane pinned to 3.10 would have caught this at PR time and is probably worth adding.

cofade added a commit that referenced this pull request Jun 28, 2026
…195)

* ci: matrix duckdb-unit + image-unit across Python 3.10-3.14 (#192)

Production runs Python 3.10 but CI only ran 3.11. That gap let a
3.11-only `from datetime import UTC` reach main in services/log_ring.py
and crash both Python services at import time on deploy (see #180);
the symptom was addressed separately by #191.

Run both pytest lanes across a ['3.10'..'3.14'] matrix (fail-fast:
false) so a version-specific API fails at PR time on the 3.10 floor
instead of on deploy.

Exact == pins can't span the range: no single numpy ships both a cp310
and a cp314 wheel, and onnxruntime 1.27.0 has no cp310 wheel. So float
only the blocking native deps to >= lower bounds and let pip resolve a
per-interpreter wheel:
- image-service: numpy>=2.0.0, onnxruntime>=1.23.2
- both services: pydantic>=2.12.5 (pinned pydantic-core lacks cp314)
duckdb 1.4.4 already spans cp310-cp314 and stays pinned. This moves a
fresh prod install onto numpy 2.x; rationale + trade-off in ADR-028.

Also carry the identical UTC -> timezone.utc change from #191 into both
log_ring.py copies, so the new 3.10 cells pass on this branch.

Verified: pip --dry-run resolves on 3.10-3.14; both suites pass locally
on numpy 2.4.6 (image-service 97, duckdb-service 232).

Docs: ADR-028 + index, ci-gates rows, chapter-11 lessons entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WTpzfM7M4q7gXzuwsxxhzp

* test(ci): harden 3.10-3.14 matrix per senior review (#192)

Addresses senior-review findings on the version-matrix change:

- P1: image-service detection tests `importorskip` onnxruntime/cv2, so a
  broken floated-wheel resolution on a cell would SKIP them and pass the
  lane green — false confidence for exactly the deps the matrix de-risks.
  Add tests/test_native_runtime.py, a non-skipping guard asserting
  hole_detection._RUNTIME_AVAILABLE is True, so a busted install turns the
  cell red.
- P1: correct the "prod runs 3.10" premise. The repo specifies its runtime
  three ways (Dockerfiles build 3.12-slim, deploy docs say 3.11, the #180
  host crashed on 3.10). Reframe 3.10 as a conservative floor proven by the
  incident, not a flat fact, across the workflow comment, ci-gates, ADR-028,
  and the chapter-11 lesson; flag the inconsistency as the deeper risk.
- P2: pin image-service ruff target-version py312 -> py310 so the UP
  (pyupgrade) rule can't rewrite timezone.utc back to the 3.11-only UTC.
- P2: note the #191 overlap in ADR-028 consequences.

Verified: image-service 98 passed on numpy 2.4.6; ruff check clean at
py310 (no UTC rewrite suggested); check-citations clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WTpzfM7M4q7gXzuwsxxhzp

* docs(ci): tighten ADR-028 precision per re-review (#192)

Doc-only precision fixes from the second senior-review pass; no behaviour
change:

- ruff comment no longer implies CI enforces the py310 floor (ruff is not
  run in any CI lane); reword to local `ruff --fix` / editor-on-save.
- Reframe the runtime drift accurately: the container path is consistently
  3.12-slim (dev + prod); the disagreement is deploy-docs prose (3.11) vs
  the bare-metal PM2 host that crashed (3.10) — not the Dockerfiles.
- onnxruntime cp314 ceiling 1.24 -> 1.24.1 (no 1.24.0 on PyPI).
- ADR-028: note duckdb-service is unexposed to the pyupgrade rewrite (no
  ruff config -> default rules exclude UP), and ground the numpy-2.x
  "low-risk" claim in the existing test_hole_detection.py inference tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WTpzfM7M4q7gXzuwsxxhzp

---------

Co-authored-by: Claude <noreply@anthropic.com>
cofade added a commit that referenced this pull request Jun 29, 2026
…closes #197) (#198)

The repo named its Python runtime three disagreeing ways — container
python:3.12-slim, deploy-docs prose 3.11, bare-metal PM2 host 3.10 — the
documented root cause behind incident #180 (a 3.11-only datetime.UTC that
crashed both Python services on deploy because CI/containers ran a different
version than the host). The 3.10-3.14 CI matrix from #195 (ADR-029) mitigated
the symptom but left the runtime-naming drift open; this closes it.

Adopt 3.10 as the authoritative floor and make it enforced, not hand-kept:

- Add /.python-version (=3.10) as the single source of truth.
- Add scripts/check-python-version.sh: asserts both Dockerfile.dev FROM lines,
  the image-service ruff floor (py310), and each CI matrix's version-aware
  minimum all match the anchor. Wired into make check-python-version, the
  husky pre-push hook, and a new python-version-consistency CI job.
- Pin both Python service images python:3.12-slim -> python:3.10-slim, so the
  container path mirrors the live 3.10 PM2 host (verified: both images build
  and import the native stack on 3.10 — onnxruntime resolves its 3.10 wheel,
  1.23.2, exactly as ADR-029 predicts).
- Reconcile every prose surface to 3.10 (constraints, building-block view,
  docker-compose doc, CLAUDE.md, CONTRIBUTING, runbook, production-deployment,
  PR/issue templates) and correct the now-ten CI job count across the docs.
- Record the reconciliation as done in ADR-029 and the chapter-11 lesson; the
  historical "named three ways" narrative is left intact (the guard scopes to
  machine-readable surfaces, so a literal-version grep can't self-trip on it).

The matrix stays 3.10-3.14: 3.10 is the floor we ship, not a cap on what the
code must tolerate. Requirements pins are untouched — the floated bounds
already resolve real wheels on 3.10. See ADR-029 for the floor rationale.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants