feat: live structured server-log console (closes #178)#180
Conversation
Reshape the admin Server Logs pipeline from raw `lines: string[]` to
structured `entries: LogEntry[]` ({ ts, level, msg }) end-to-end, with no
new data sources yet (Phase 1 of #178).
- contracts: add LogLevel + LogEntry; ServerLogsResponse now carries
`entries` instead of `lines` (wire type lives only in contracts/).
- backend: logRing holds LogEntry[] via two no-double-capture paths
(stdout/stderr tee fallback + pushEntry); new log.ts structured logger
writes through the saved original stream so the tee never re-captures it;
server.ts banners go through the logger; the misleading
`http://localhost:PORT` boot banner now states the port + all-interfaces.
- duckdb-service + image-service: symmetric log_ring upgrade (entry dicts,
level-aware tee, log_event) and /logs returns `entries`.
- homepage: ServerLogsPanel renders timestamp + level badge + msg,
color-coded by level.
- tests + api-reference/api-contracts docs updated for the new wire shape.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Senior-review follow-ups (no behavior change beyond the boot banners): - Wire a real production caller for the Python structured logger: both Flask services emit a boot banner via log_event() at import, so the structured ingestion path is exercised in production, not just scaffolding. - Rewrite the Python no-double-capture tests to run in the installed state (tee on sys.stdout), so they actually fail if log_event wrote through the tee instead of the saved real stream. - Keep the dev admin key out of the admin-readable ring: server.ts writes the dev-key convenience lines via the writeStdout bypass (terminal only). - contracts: correct the entries comment — in-memory in Phase 1; on-disk persistence is ADR-022/Phase 2. - backend proxy: also validate `truncated` is a boolean before forwarding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Re-review follow-ups (test/comment accuracy only, no production change): - no-double-capture test: assert the ring holds exactly one entry after a single log_event in the installed (tee-on-stdout) state — the unambiguous discriminator. The prior raw-substring count could not catch a buggy write through the tee (the re-captured entry holds the formatted line, a different string); comment corrected to say so. - backend proxy: add a case pinning the new `truncated` boolean guard branch (valid service + entries but non-boolean truncated → 502). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Every handled request now emits one structured access entry through the Phase-1 logger, so the admin Server Logs panel fills with real traffic (on success too), visible per service: - backend: new accessLog.ts Express middleware mounted after cookieParser; on res 'finish' emits "method path status ms" via log.* (>=500 error, >=400 warn, else info). - duckdb-service + image-service: symmetric @app.before_request / @app.after_request hooks emitting the same shape via log_event. Security (load-bearing): logs req.path ONLY — never headers, body, or query string — so the X-Admin-Key header, the admin login password, and any ?token=/?key= value cannot reach the admin-readable ring. Tests assert a real request lands as an entry at the right level AND that no secret/query/ header substring appears. Docs: api-reference notes the access entry + path-only redaction; ADR-021's now-stale "dev key appears in service=backend" consequence corrected (Phase 1 moved it to the ring-bypassing writeStdout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
…178 phase 3) Server logs now survive restart, bounded to 30 days / 100 MB (ADR-022, supersedes ADR-021's in-memory consequence). Gated on LOG_DIR (unset = in-memory only, as before). - backend: logRing.ts gains initLogPersistence — appends each entry as JSONL via rotating-file-stream (daily + 50MB rotate; retain <=30 files AND <=100MB total) and backfills the ring from the active file tail at startup. Wired in server.ts before the boot banners. - duckdb-service + image-service: byte-identical log_ring.py init_persistence using stdlib TimedRotatingFileHandler subclassed with a 100MB size-prune sweep; same JSONL + backfill. Distinct LOG_DIR per service (subdirs of the shared duckdb_data volume) so they never collide on one file. - compose (dev + prod): LOG_DIR env + a backend_logs volume; Flask services persist under /data/logs/<svc>. - ADR-022 written + indexed; ADR-021 marked superseded-in-part. Docs: auth.md (disk-persisted-secret caveat), docker-compose.md (log volumes), ch.11 (30-day retention/PII + per-process SSE emitter trade-offs). Tests: JSONL append, startup backfill (survives restart), no-op without LOG_DIR, malformed-line skip — backend + both Flask. Gate green: backend 191, duckdb 198, image 74, homepage 161; tsc/ruff/citations clean; compose validates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
The panel now streams new entries live after the REST backfill — "tail -f" with follow-mode auto-scroll — across all three services. - Ring pub/sub at the existing choke points: backend logRing.ts emits from pushEntryInternal to a Set of subscribers (subscribeEntries); Flask _push fans entries out to per-connection bounded queues (subscribe/unsubscribe), drop-on-full so the logger never blocks. - Backend GET /api/admin/logs/stream (requireAdmin): streams its own ring for `backend`; for the Flask services it awaits the upstream fetch first (connect failure still 502s) then pipes their /logs/stream through. Sets X-Accel-Buffering: no + a 25s keepalive; cleans up on client close. New logStream.ts helper keeps the route thin. - Flask /logs/stream (both services, X-Admin-Key gated): stream_with_context generator yielding one LogEntry per data: event + keepalives, unsubscribe in finally on disconnect. - Frontend: api.streamServerLogs() -> EventSource(withCredentials); panel does REST backfill then opens the stream, appends live entries (capped), follow mode with "Jump to latest", and a live/connecting indicator. EventSource stub added to test-setup for jsdom. - Docs: host-nginx proxy_buffering off block, api-reference + api-contracts SSE endpoint. No wire-shape change — the SSE event payload is one LogEntry. Tests: ring subscribe/unsubscribe; backend SSE endpoint over a real ephemeral server (backend ring) + mocked ReadableStream pipe + 502 on connect-fail; Flask subscribe/route stream; component live-append + close-on-unmount/service-change. Gate green: backend 197, homepage 164, duckdb 201, image 77; tsc/ruff/build/ check-citations clean. (Full live-stream E2E is the Phase 5 Playwright spec.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
… phase 5) Makes the live console usable at volume — all client-side, no endpoint or contract changes. - Panel: a search box (case-insensitive substring over msg) + three level checkboxes drive a derived visibleEntries; live-append, follow-mode, and "Jump to latest" keep working off the full buffer. "Download .log" builds a client-side <ts> <LEVEL> <msg> blob and downloads it — all loaded entries when unfiltered, the filtered view when a search/level filter is active (per the agreed scope); button disables when nothing would export. Adds a "No entries match the filter" state. - Component tests: search narrows, level toggle hides, no-match disables export, and export blob content for both scopes (Blob-constructor spy since jsdom Blob has no .text()). - Playwright (tests/ui/admin-server-logs.spec.ts): the end-to-end proof — wait for the live indicator, fire a uniquely-pathed backend request, assert its access-log entry STREAMS into the panel with no refresh, search-filter to it, and assert a real .log download with structured lines. UI stack is in-memory (no LOG_DIR) which is fine for a live assertion. - Docs: api-reference notes the panel's live indicator + search/filter/export. Gate: homepage 169 (0 errors), tsc + build clean, UI spec typechecks, check-citations clean; backend/Flask unchanged from phase 4 (197/201/77). Playwright runs under the existing `make test-ui` CI job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
…unds P0 — backend process crash on every Flask-service live-tail disconnect: the proxy piped Readable.fromWeb(upstream.body) with a bare .pipe(), so the AbortError raised when the client disconnects (controller.abort()) was an unhandled stream 'error'. Switched to stream.pipeline() with an error callback that ignores the expected abort. Added a regression test that opens the proxied stream, aborts mid-stream, and asserts the process is still serving. P1 — rotation bounds were untested and asymmetric: - Flask now also rolls the active file at 50 MB (_PruningTimedHandler.shouldRollover size check), matching the Node rfs size:'50M' trigger, so the 100 MB total bound holds continuously instead of only at the daily boundary. - Behavioural tests (both Flask services): _prune_by_size evicts the oldest rotated file and never the active one; the active file rolls over once it passes the size cap. Node: a config test pins maxFiles:30 + maxSize:'100M' + size:'50M' (eviction itself is rfs's tested behaviour). P2 + hardening: - log_event now uses the entry returned by _push instead of reading _ring[-1] outside the lock (correct ts under concurrency). - streamBackendRing guards res.write against a dead socket and tears down. - Panel comments the SSE reconnect-gap behaviour. - New test asserts the two services' log_ring.py stay byte-identical (it already caught a real drift while writing this commit); ADR-022 updated to match. Gate: backend 200, homepage 169, duckdb 204, image 79; tsc/ruff/check-citations clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
…iew P0) The previous commit's size trigger fired but stdlib TimedRotatingFileHandler. doRollover collides on the same-day dated filename (service.log.YYYY-MM-DD) and refuses to rotate a second time the same day — so after the first size-roll the active file grew unbounded (reproduced: 200-byte cap → 184 KB active file), and _prune_by_size (which only runs inside doRollover) stopped enforcing the 100 MB total too. This shipped into the shared duckdb_data volume. Fully override doRollover: rename to a UNIQUE timestamped file (service.log.YYYY-MM-DD_HH-MM-SS[.idx]) on every trigger, recompute the next daily boundary, reopen, then prune by count (<=30) and size (<=100 MB). New _prune_by_count helper. shouldRollover no longer delegates the time check to the broken super(). The rollover test now asserts the real property: under 2000 sustained writes at a 500-byte cap the ACTIVE file stays < 4x the cap (not the ~100 KB written), multiple distinct rotated files exist, and the count bound holds. Verified the test fails against the old broken doRollover (184 KB active). Both Flask services stay byte-identical (symmetry guard green). Gate: duckdb 204, image 79 (backend 200, homepage 169 unchanged); ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
… re-review P2) - _prune_by_size: comment the _MAX_FILE_BYTES < _MAX_TOTAL_BYTES invariant that makes skipping the active file safe (the active file alone can't exceed the total cap, and a size-roll always moves its bytes into an evictable file). - rollover test: assert the rotated-file count reaches exactly _MAX_BACKUP_FILES (the count bound the scenario actually exercises), not just ">= 2". Both Flask services stay byte-identical. duckdb 204, image 79; ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Resolve the ADR-number collision: main shipped adr-022-build-time-feature-flags (#177) while this branch added its own ADR-022. Renumber this branch''s persistent-structured-server-logs ADR to ADR-023 — rename the file, fix the ADR index + the ADR-021 supersede pointer, and update all 25 persistent-logs references (code comments, docker-compose, docs) from ADR-022 to ADR-023. Feature-flag ADR-022 references are left untouched. docker-compose.yml and chapter 11 auto-merged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The `<select>` for backend/duckdb-service/image-service inherited a near-white text color, rendering the selected service almost invisible on the white panel header. Pin it to `text-gray-800` to match the panel heading. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two branches both claimed ADR-022 (build-time-feature-flags and persistent-structured-server-logs) and collided on merge. Record the convention in the ADR index intro so the next contributor knows the later-merged branch renumbers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ADR-022->023 renumber for the server-logs ADR accidentally flipped two feature-flag references that share the same string (the renumber review filter hid lines containing VITE_): docker-compose.yml''s dev feature-flag comment and a chapter-11 link. Both point at build-time-feature-flags (ADR-022); restore them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…en log invariants - Pin `threaded=True` explicitly on both Flask `app.run` calls (already the Flask default, but load-bearing): the SSE live tail holds a worker for the stream''s lifetime, so concurrent request handling must be guaranteed or an open admin tail would stall uploads/reads. Documented in ADR-023 Negative and chapter 11 (with the empirical /health-stays-fast check). - Validate `level` on disk backfill (backend + both byte-identical Flask log_ring.py) so a malformed/partial line missing `level` is skipped rather than admitted and saved only by the renderer''s fallback — the contract requires `level`. - Tighten the active-file size assertion from `< 4x cap` to `< cap + 256` (one entry) in both Flask test_logs.py, pinning what the comment claims. - Fix contradictory phase labels: disk persistence is Phase 3, not Phase 2 (access logging is Phase 2) — contracts/index.ts + backend/logRing.ts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per senior review: comment that the disk write and SSE fan-out are deliberately outside _lock so the next contributor doesn''t fix the sub-millisecond ordering by widening the lock over the blocking disk write — which would serialize every logging call across request threads. Applied identically to both byte-identical Flask log_ring.py files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
from datetime import UTC, datetime
...
datetime.now(UTC)
Fix is up in #191 — it swaps Nothing for you to do here — just flagging so the 3.11-ism doesn't get reintroduced. For the 3.10 target, |
…195) * ci: matrix duckdb-unit + image-unit across Python 3.10-3.14 (#192) Production runs Python 3.10 but CI only ran 3.11. That gap let a 3.11-only `from datetime import UTC` reach main in services/log_ring.py and crash both Python services at import time on deploy (see #180); the symptom was addressed separately by #191. Run both pytest lanes across a ['3.10'..'3.14'] matrix (fail-fast: false) so a version-specific API fails at PR time on the 3.10 floor instead of on deploy. Exact == pins can't span the range: no single numpy ships both a cp310 and a cp314 wheel, and onnxruntime 1.27.0 has no cp310 wheel. So float only the blocking native deps to >= lower bounds and let pip resolve a per-interpreter wheel: - image-service: numpy>=2.0.0, onnxruntime>=1.23.2 - both services: pydantic>=2.12.5 (pinned pydantic-core lacks cp314) duckdb 1.4.4 already spans cp310-cp314 and stays pinned. This moves a fresh prod install onto numpy 2.x; rationale + trade-off in ADR-028. Also carry the identical UTC -> timezone.utc change from #191 into both log_ring.py copies, so the new 3.10 cells pass on this branch. Verified: pip --dry-run resolves on 3.10-3.14; both suites pass locally on numpy 2.4.6 (image-service 97, duckdb-service 232). Docs: ADR-028 + index, ci-gates rows, chapter-11 lessons entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WTpzfM7M4q7gXzuwsxxhzp * test(ci): harden 3.10-3.14 matrix per senior review (#192) Addresses senior-review findings on the version-matrix change: - P1: image-service detection tests `importorskip` onnxruntime/cv2, so a broken floated-wheel resolution on a cell would SKIP them and pass the lane green — false confidence for exactly the deps the matrix de-risks. Add tests/test_native_runtime.py, a non-skipping guard asserting hole_detection._RUNTIME_AVAILABLE is True, so a busted install turns the cell red. - P1: correct the "prod runs 3.10" premise. The repo specifies its runtime three ways (Dockerfiles build 3.12-slim, deploy docs say 3.11, the #180 host crashed on 3.10). Reframe 3.10 as a conservative floor proven by the incident, not a flat fact, across the workflow comment, ci-gates, ADR-028, and the chapter-11 lesson; flag the inconsistency as the deeper risk. - P2: pin image-service ruff target-version py312 -> py310 so the UP (pyupgrade) rule can't rewrite timezone.utc back to the 3.11-only UTC. - P2: note the #191 overlap in ADR-028 consequences. Verified: image-service 98 passed on numpy 2.4.6; ruff check clean at py310 (no UTC rewrite suggested); check-citations clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WTpzfM7M4q7gXzuwsxxhzp * docs(ci): tighten ADR-028 precision per re-review (#192) Doc-only precision fixes from the second senior-review pass; no behaviour change: - ruff comment no longer implies CI enforces the py310 floor (ruff is not run in any CI lane); reword to local `ruff --fix` / editor-on-save. - Reframe the runtime drift accurately: the container path is consistently 3.12-slim (dev + prod); the disagreement is deploy-docs prose (3.11) vs the bare-metal PM2 host that crashed (3.10) — not the Dockerfiles. - onnxruntime cp314 ceiling 1.24 -> 1.24.1 (no 1.24.0 on PyPI). - ADR-028: note duckdb-service is unexposed to the pyupgrade rewrite (no ruff config -> default rules exclude UP), and ground the numpy-2.x "low-risk" claim in the existing test_hole_detection.py inference tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WTpzfM7M4q7gXzuwsxxhzp --------- Co-authored-by: Claude <noreply@anthropic.com>
…closes #197) (#198) The repo named its Python runtime three disagreeing ways — container python:3.12-slim, deploy-docs prose 3.11, bare-metal PM2 host 3.10 — the documented root cause behind incident #180 (a 3.11-only datetime.UTC that crashed both Python services on deploy because CI/containers ran a different version than the host). The 3.10-3.14 CI matrix from #195 (ADR-029) mitigated the symptom but left the runtime-naming drift open; this closes it. Adopt 3.10 as the authoritative floor and make it enforced, not hand-kept: - Add /.python-version (=3.10) as the single source of truth. - Add scripts/check-python-version.sh: asserts both Dockerfile.dev FROM lines, the image-service ruff floor (py310), and each CI matrix's version-aware minimum all match the anchor. Wired into make check-python-version, the husky pre-push hook, and a new python-version-consistency CI job. - Pin both Python service images python:3.12-slim -> python:3.10-slim, so the container path mirrors the live 3.10 PM2 host (verified: both images build and import the native stack on 3.10 — onnxruntime resolves its 3.10 wheel, 1.23.2, exactly as ADR-029 predicts). - Reconcile every prose surface to 3.10 (constraints, building-block view, docker-compose doc, CLAUDE.md, CONTRIBUTING, runbook, production-deployment, PR/issue templates) and correct the now-ten CI job count across the docs. - Record the reconciliation as done in ADR-029 and the chapter-11 lesson; the historical "named three ways" narrative is left intact (the guard scopes to machine-readable surfaces, so a literal-version grep can't self-trip on it). The matrix stays 3.10-3.14: 3.10 is the floor we ship, not a cap on what the code must tolerate. Requirements pins are untouched — the floated bounds already resolve real wheels on 3.10. See ADR-029 for the floor rationale. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Turns the admin Server Logs panel (
/admin → Server Logs) from a ~2-line static boot-banner dump into a genuine live operations console. The panel UI was already fine — the real gap was the data source: the services barely wrote to stdout, the ring was in-memory only, and lines were unstructured. Per the locked decisions in #178: structured timestamped entries, per-request access logging, disk persistence that survives restart (30 days / 100 MB), SSE live tail, and client-side search/filter/export — across all three services.Delivered as 5 reviewable commits (one per phase) plus review-fix commits, all in this single PR. Architecture + rationale: ADR-022.
What changed
Contract (defined once, in
contracts/):ServerLogsResponse.lines: string[]→entries: LogEntry[]({ ts, level, msg }), plusLogLevel. The sameLogEntryis the SSE event payload — no separate type.Backend + both Flask services (symmetric):
method path status ms, level by status) — path-only, never headers/body/query, so the admin key, the login password, and?token=values can't reach the admin-readable (and now disk-persisted) ring.LOG_DIR: JSONL, rotate daily-or-50 MB, retain ≤30 files and ≤100 MB total, with startup backfill so history survives a restart. Node usesrotating-file-stream; Flask uses a customTimedRotatingFileHandlersubclass (overriddendoRolloverso size-rolls actually bound the active file).GET /api/admin/logs/stream— backend streams its own ring or pipes the Flask services' internal/logs/stream.Frontend (
ServerLogsPanel.tsx): REST backfill →EventSourcelive tail with follow-mode + live indicator; rows color-coded by level; a search box, level checkboxes, and a client-sideDownload .logexport.Docs: ADR-022 (supersedes ADR-021's in-memory consequence), api-reference + api-contracts, auth.md (disk-persisted-secret caveat), chapter 11 (retention/PII + per-process SSE), compose volumes + host-nginx SSE no-buffer.
Side-fix
The misleading
http://localhost:3001boot banner now readslistening on port N (all interfaces)— it never was localhost-only (issue comment 2).Testing
tsc/ruff/build/make check-citationsclean.log_ring.pyfiles stay byte-identical (it caught a real drift mid-development).make test-ui) was not run in the dev sandbox — relies on the CItest-uijob.Open considerations
print/console.logsecrets" rule — access logs are path-only and the dev-key banner bypasses the ring viawriteStdout; called out in ADR-022 + auth.md.🤖 Generated with Claude Code
https://claude.ai/code/session_01HU7JHMbvtUKRfZY98qeWdK
Generated by Claude Code