fix(session): recover orphan active sessions on daemon startup#18
fix(session): recover orphan active sessions on daemon startup#18Chen17-sq wants to merge 2 commits intoEinsia:mainfrom
Conversation
The graceful shutdown path calls ``SessionManager.force_end`` so any
in-progress session is persisted as ``ended`` before exit. A SIGKILL
/ OOM kill / kernel panic / power loss skips that — the ``sessions``
row stays at ``status='active'`` with ``end_time=NULL``, and the
daily safety-net's ``reduce_all_pending`` query explicitly excludes
``active`` rows
(``session/store.list_pending_reduction`` filters
``status IN ('ended', 'failed') AND end_time IS NOT NULL``). The
session is then never reduced, never classified, and silently absent
from ``event-*.md``: a full block of the user's day is lost without
any error or log.
This adds ``session/recovery.recover_orphan_sessions(cfg)``, called
once at daemon startup before ``build_manager`` (so the new
SessionManager can't race the recovery). It lists every ``active``
row, infers a plausible ``end_time`` from the timeline blocks the
previous run did persist, and force-ends each row so the existing
catch-up machinery picks it up on the next pass.
End-time inference uses two new SQL helpers:
* ``timeline_store.latest_end_in_window(conn, start, end)`` — most
recent block end whose ``start_time ∈ [start, end)``.
* ``session_store.next_session_start_after(conn, start_time)`` — the
earliest start_time of any session that opened after this one. Used
as the upper bound on the orphan's plausible window so two
consecutive orphans don't claim each other's blocks.
The window is ``[start_time, min(next_session_start, start_time +
max_session_hours))``. If no block exists in that window (the crash
beat the 1-min aggregator), end_time falls back to ``start_time + 1
minute`` so the reducer's empty-window code path runs and marks the
row ``reduced`` as a no-op.
Out of scope (intentionally):
* Firing the reducer immediately on startup. Recovered sessions are
reduced by the next daily safety-net tick (worst case ~24h
latency). A faster-recovery follow-up can fold in the on_done +
classifier wiring without changing the storage layer.
* Age-based filtering. Even year-old orphans are recovered; if the
blocks have been pruned, the empty-window branch cleans them up
naturally.
Tests (``threading.Barrier``-style precision is unnecessary here —
recovery is single-threaded, single-DB, deterministic):
* ``test_recover_no_orphans_is_noop`` — empty DB, 0 returned.
* ``test_recover_orphan_with_blocks_uses_latest_block_end`` — happy
path: orphan plus three blocks, ends at the latest block end.
* ``test_recover_orphan_with_no_blocks_uses_one_minute_fallback`` —
crash before aggregator wrote a block.
* ``test_recover_two_orphans_dont_pollute_each_other`` — the key
edge case: A's window must not extend into B's blocks. Without
``next_session_start_after`` this fails.
* ``test_recover_caps_at_max_session_hours`` — a stray far-future
block can't stretch the orphan past the 2h ceiling.
* ``test_recover_is_idempotent`` — second call sees no active rows.
* ``test_recovered_session_visible_to_reduce_all_pending_query`` —
integration: the recovered row matches the safety-net's catch-up
SQL.
76/76 pass; ruff clean. Independent of Einsia#11 / Einsia#12 (different files;
either order merges cleanly).
There was a problem hiding this comment.
Code Review
This pull request implements a session recovery mechanism to handle 'orphan' active sessions resulting from hard crashes. It introduces a new recovery module that infers session end times from persisted timeline blocks or a fallback duration, ensuring these sessions can be processed by the daily safety-net. The changes include new database queries and comprehensive tests. A review comment suggests capping the fallback end time by the next session's start time to ensure sessions remain disjoint.
| # cuts are aligned, but be defensive). | ||
| return min(block_end, upper_bound) | ||
|
|
||
| return start_time + _EMPTY_SESSION_FALLBACK |
There was a problem hiding this comment.
The fallback end_time should be capped by upper_bound to ensure it does not overlap with the start of the next session. This maintains the invariant that sessions are disjoint, even in cases where a crash occurs very shortly before a subsequent session begins.
| return start_time + _EMPTY_SESSION_FALLBACK | |
| return min(start_time + _EMPTY_SESSION_FALLBACK, upper_bound) |
Address review feedback: ``_infer_end_time``'s fallback path for an orphan with no persisted timeline blocks was returning ``start_time + 1 minute`` unconditionally. When two consecutive hard crashes leave back-to-back orphans within a minute of each other, that pushes the older orphan's end_time past the newer one's start, and the eventual reducer pass over ``[start, end)`` would double-attribute the newer session's first block. Apply ``min(..., upper_bound)`` so the fallback respects the same disjoint-sessions invariant the block-based path already enforced. Adds ``test_recover_blockless_orphan_clamps_fallback_to_next_session_start`` to lock in the boundary. 77/77 pass; ruff clean.
|
Addressed in b27ad59 — applied the clamp exactly as suggested. The interesting half is the failure mode: it's not just an "overlap" concern, it's a double-attribution bug. With back-to-back hard crashes inside a minute, the older orphan's reducer pass over Added 77/77 pass, ruff clean. |
Summary — silent data loss across hard crashes
The graceful shutdown path calls
SessionManager.force_endso any in-progress session is persisted asendedbefore exit. A SIGKILL / OOM kill / kernel panic / power loss skips that — thesessionsrow stays atstatus='active'withend_time=NULL, and the daily safety-net'sreduce_all_pendingquery explicitly excludesactiverows:The session is then never reduced, never classified, and silently absent from
event-*.md— a full block of the user's day is lost without any error or log entry. The bug surfaces only when reproduced:kill -9the daemon mid-session, restart, watch the orphan row sit forever.Fix
session/recovery.recover_orphan_sessions(cfg)runs once at daemon startup, beforebuild_manager(so the newSessionManagercan't race the recovery). It lists everyactiverow, infers a plausibleend_timefrom the timeline blocks the previous run did persist, and force-ends each row so the existing catch-up machinery picks it up on the next pass.End-time inference adds two SQL helpers:
timeline_store.latest_end_in_window(conn, start, end)— most recent block end whosestart_time ∈ [start, end).session_store.next_session_start_after(conn, start_time)— earlieststart_timeof any session that opened after this one. Used as the upper bound so two consecutive orphans (e.g. two crashes with one normal session in between) don't claim each other's blocks.The inferred window is
[start_time, min(next_session_start, start_time + max_session_hours)). If no block exists in that window (the crash beat the 1-min aggregator), end_time falls back tostart_time + 1 minuteand the reducer's empty-window branch marks the rowreducedas a no-op — the orphan is gone either way.Out of scope (intentionally)
on_done+ classifier wiring without changing the storage layer.Relationship to #11 / #12
Independent — different files (
daemon.py,session/,timeline/store.py), no overlap. Mergeable in any order.Test plan
uv run pytest— 76/76 (69 existing + 7 new)uv run ruff check src/ tests/test_session_recovery.py— cleannext_session_start_afterquery is status-agnostic, so the good session's start still bounds the older orphan's window. Working correctly.New tests:
test_recover_no_orphans_is_nooptest_recover_orphan_with_blocks_uses_latest_block_endtest_recover_orphan_with_no_blocks_uses_one_minute_fallbacktest_recover_two_orphans_dont_pollute_each_other← the key edge case: A's window must not extend into B's blocks. Withoutnext_session_start_afterthis would fail.test_recover_caps_at_max_session_hourstest_recover_is_idempotenttest_recovered_session_visible_to_reduce_all_pending_query— integration check that the recovered row matches the safety-net's catch-up SQL.