Skip to content

fix(duckdb): stamp silence-watcher clock in UTC so ages survive TZ#199

Open
sronix wants to merge 1 commit into
schutera:mainfrom
sronix:fix/silence-watcher-utc
Open

fix(duckdb): stamp silence-watcher clock in UTC so ages survive TZ#199
sronix wants to merge 1 commit into
schutera:mainfrom
sronix:fix/silence-watcher-utc

Conversation

@sronix

@sronix sronix commented Jul 2, 2026

Copy link
Copy Markdown

What changed

  • duckdb-service/services/silence_watcher.py: check_silence now takes its clock from datetime.now(timezone.utc).replace(tzinfo=None) instead of naive-local datetime.now(), mirroring the received_at writer in routes/heartbeats.py.
  • First tests for the watcher (tests/test_silence_watcher.py): flips the process TZ to a fixed UTC+2 zone (Etc/GMT-2 + time.tzset(), POSIX-guarded) and pins that a module seen 1.5 h ago stays quiet, a 4 h-silent module still alerts, and last_silence_alert_at lands UTC-naive; plus recovery-fires-once and re-alert-suppression baselines. conftest.py purges services.silence_watcher alongside the other service modules so each test binds the fresh DB.
  • Chapter 11: adds the lesson entry record_image's comment promised ("chapter-11 entry to follow"), and that comment now points at it. ADR-005's four silence_watcher.py:NN citations converted to symbol style since this diff shifted them.

Why

Every timestamp the watcher compares against is stamped UTC-naive by its writers, but the watcher's clock was container-local, which agrees only while the container TZ is UTC (the python-slim default). A TZ=Europe/Berlin override would inflate every liveness age 1-2 h against the 3 h threshold and fire false module-down Discord alerts on each 15-minute tick; a TZ behind UTC would suppress real outages. Reader-side twin of the chapter-11 image_uploads.uploaded_at incident. Acknowledged follow-up, deliberately not in this diff: add_module's last_seen_at = NOW() and the schema's DEFAULT CURRENT_TIMESTAMP columns remain container-local timestamp sources (tracked in the new chapter-11 entry).

How tested

  • ESP32-CAM native (pio test -e native)
  • End-to-end (pytest tests/e2e)
  • Backend unit (Node 22 + TS)
  • image-service unit (pytest, Python 3.10-3.14 matrix)
  • duckdb-service unit (pytest, Python 3.10-3.14 matrix)
  • Homepage unit (React 19 + Vite)
  • Manual / hardware-in-the-loop verification

237 passed locally (232 on main + new) on Python 3.14. The regression test fails against the unfixed watcher: false "down" alert for a module seen 90 min ago under UTC+2, and last_silence_alert_at lands 7200 s off UTC. scripts/check-doc-citations.sh: 3 OK, 0 problems. ruff check + format clean. Other suites untouched by this change.

Checklist

  • Tests added or updated to cover the change
  • Documentation updated where applicable (chapter 11, ADR-005 citations)
  • No secrets, credentials, or large binaries committed
  • CI is green on this branch
  • Breaking changes called out in the description (none)

Every timestamp the watcher compares against is stamped UTC-naive by its
writers, but check_silence took its clock from naive-local datetime.now(),
which only agrees while the container TZ is UTC (the python-slim default).
Under TZ=Europe/Berlin every liveness age would inflate by 1-2 h against
the 3 h threshold and fire false module-down Discord alerts on each
15-minute tick, while a TZ behind UTC would suppress real ones; the
re-alert spacing and the last_silence_alert_at write-back skew the same
way. The new regression test flips the process TZ to a fixed UTC+2 zone
via time.tzset() and pins that a module seen 1.5 h ago stays quiet. The
NOW() and DEFAULT CURRENT_TIMESTAMP writers on module_configs carry the
residual naive-local risk and stay as a follow-up (see the chapter-11
entry this change adds).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant