Skip to content

fix(session): recover orphan active sessions on daemon startup#18

Open
Chen17-sq wants to merge 2 commits intoEinsia:mainfrom
Chen17-sq:fix/recover-orphan-active-sessions
Open

fix(session): recover orphan active sessions on daemon startup#18
Chen17-sq wants to merge 2 commits intoEinsia:mainfrom
Chen17-sq:fix/recover-orphan-active-sessions

Conversation

@Chen17-sq
Copy link
Copy Markdown
Contributor

Summary — silent data loss across hard crashes

The graceful shutdown path calls SessionManager.force_end so any in-progress session is persisted as ended before exit. A SIGKILL / OOM kill / kernel panic / power loss skips that — the sessions row stays at status='active' with end_time=NULL, and the daily safety-net's reduce_all_pending query explicitly excludes active rows:

# session/store.py:list_pending_reduction
WHERE status IN ('ended', 'failed') AND end_time IS NOT NULL

The session is then never reduced, never classified, and silently absent from event-*.md — a full block of the user's day is lost without any error or log entry. The bug surfaces only when reproduced: kill -9 the daemon mid-session, restart, watch the orphan row sit forever.

Fix

session/recovery.recover_orphan_sessions(cfg) runs once at daemon startup, before build_manager (so the new SessionManager can't race the recovery). It lists every active row, infers a plausible end_time from the timeline blocks the previous run did persist, and force-ends each row so the existing catch-up machinery picks it up on the next pass.

End-time inference adds two SQL helpers:

  • timeline_store.latest_end_in_window(conn, start, end) — most recent block end whose start_time ∈ [start, end).
  • session_store.next_session_start_after(conn, start_time) — earliest start_time of any session that opened after this one. Used as the upper bound so two consecutive orphans (e.g. two crashes with one normal session in between) don't claim each other's blocks.

The inferred window is [start_time, min(next_session_start, start_time + max_session_hours)). If no block exists in that window (the crash beat the 1-min aggregator), end_time falls back to start_time + 1 minute and the reducer's empty-window branch marks the row reduced as a no-op — the orphan is gone either way.

Out of scope (intentionally)

  • Firing the reducer immediately on startup. Recovered sessions are reduced by the next daily safety-net tick (worst case ~24h latency). A faster-recovery follow-up can fold in the on_done + classifier wiring without changing the storage layer.
  • Age-based filtering. Even year-old orphans are recovered; if the blocks have been pruned, the empty-window branch cleans them up naturally.

Relationship to #11 / #12

Independent — different files (daemon.py, session/, timeline/store.py), no overlap. Mergeable in any order.

Test plan

  • uv run pytest — 76/76 (69 existing + 7 new)
  • uv run ruff check src/ tests/test_session_recovery.py — clean
  • Edge case manually traced: two consecutive hard-crashes with one good session in the middle — next_session_start_after query is status-agnostic, so the good session's start still bounds the older orphan's window. Working correctly.

New tests:

  • test_recover_no_orphans_is_noop
  • test_recover_orphan_with_blocks_uses_latest_block_end
  • test_recover_orphan_with_no_blocks_uses_one_minute_fallback
  • test_recover_two_orphans_dont_pollute_each_otherthe key edge case: A's window must not extend into B's blocks. Without next_session_start_after this would fail.
  • test_recover_caps_at_max_session_hours
  • test_recover_is_idempotent
  • test_recovered_session_visible_to_reduce_all_pending_query — integration check that the recovered row matches the safety-net's catch-up SQL.

The graceful shutdown path calls ``SessionManager.force_end`` so any
in-progress session is persisted as ``ended`` before exit. A SIGKILL
/ OOM kill / kernel panic / power loss skips that — the ``sessions``
row stays at ``status='active'`` with ``end_time=NULL``, and the
daily safety-net's ``reduce_all_pending`` query explicitly excludes
``active`` rows
(``session/store.list_pending_reduction`` filters
``status IN ('ended', 'failed') AND end_time IS NOT NULL``). The
session is then never reduced, never classified, and silently absent
from ``event-*.md``: a full block of the user's day is lost without
any error or log.

This adds ``session/recovery.recover_orphan_sessions(cfg)``, called
once at daemon startup before ``build_manager`` (so the new
SessionManager can't race the recovery). It lists every ``active``
row, infers a plausible ``end_time`` from the timeline blocks the
previous run did persist, and force-ends each row so the existing
catch-up machinery picks it up on the next pass.

End-time inference uses two new SQL helpers:

* ``timeline_store.latest_end_in_window(conn, start, end)`` — most
  recent block end whose ``start_time ∈ [start, end)``.
* ``session_store.next_session_start_after(conn, start_time)`` — the
  earliest start_time of any session that opened after this one. Used
  as the upper bound on the orphan's plausible window so two
  consecutive orphans don't claim each other's blocks.

The window is ``[start_time, min(next_session_start, start_time +
max_session_hours))``. If no block exists in that window (the crash
beat the 1-min aggregator), end_time falls back to ``start_time + 1
minute`` so the reducer's empty-window code path runs and marks the
row ``reduced`` as a no-op.

Out of scope (intentionally):

* Firing the reducer immediately on startup. Recovered sessions are
  reduced by the next daily safety-net tick (worst case ~24h
  latency). A faster-recovery follow-up can fold in the on_done +
  classifier wiring without changing the storage layer.
* Age-based filtering. Even year-old orphans are recovered; if the
  blocks have been pruned, the empty-window branch cleans them up
  naturally.

Tests (``threading.Barrier``-style precision is unnecessary here —
recovery is single-threaded, single-DB, deterministic):

* ``test_recover_no_orphans_is_noop`` — empty DB, 0 returned.
* ``test_recover_orphan_with_blocks_uses_latest_block_end`` — happy
  path: orphan plus three blocks, ends at the latest block end.
* ``test_recover_orphan_with_no_blocks_uses_one_minute_fallback`` —
  crash before aggregator wrote a block.
* ``test_recover_two_orphans_dont_pollute_each_other`` — the key
  edge case: A's window must not extend into B's blocks. Without
  ``next_session_start_after`` this fails.
* ``test_recover_caps_at_max_session_hours`` — a stray far-future
  block can't stretch the orphan past the 2h ceiling.
* ``test_recover_is_idempotent`` — second call sees no active rows.
* ``test_recovered_session_visible_to_reduce_all_pending_query`` —
  integration: the recovered row matches the safety-net's catch-up
  SQL.

76/76 pass; ruff clean. Independent of Einsia#11 / Einsia#12 (different files;
either order merges cleanly).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a session recovery mechanism to handle 'orphan' active sessions resulting from hard crashes. It introduces a new recovery module that infers session end times from persisted timeline blocks or a fallback duration, ensuring these sessions can be processed by the daily safety-net. The changes include new database queries and comprehensive tests. A review comment suggests capping the fallback end time by the next session's start time to ensure sessions remain disjoint.

Comment thread src/openchronicle/session/recovery.py Outdated
# cuts are aligned, but be defensive).
return min(block_end, upper_bound)

return start_time + _EMPTY_SESSION_FALLBACK
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback end_time should be capped by upper_bound to ensure it does not overlap with the start of the next session. This maintains the invariant that sessions are disjoint, even in cases where a crash occurs very shortly before a subsequent session begins.

Suggested change
return start_time + _EMPTY_SESSION_FALLBACK
return min(start_time + _EMPTY_SESSION_FALLBACK, upper_bound)

Address review feedback: ``_infer_end_time``'s fallback path for an
orphan with no persisted timeline blocks was returning
``start_time + 1 minute`` unconditionally. When two consecutive hard
crashes leave back-to-back orphans within a minute of each other,
that pushes the older orphan's end_time past the newer one's start,
and the eventual reducer pass over ``[start, end)`` would
double-attribute the newer session's first block.

Apply ``min(..., upper_bound)`` so the fallback respects the same
disjoint-sessions invariant the block-based path already enforced.
Adds ``test_recover_blockless_orphan_clamps_fallback_to_next_session_start``
to lock in the boundary.

77/77 pass; ruff clean.
@Chen17-sq
Copy link
Copy Markdown
Contributor Author

Addressed in b27ad59 — applied the clamp exactly as suggested. The interesting half is the failure mode: it's not just an "overlap" concern, it's a double-attribution bug. With back-to-back hard crashes inside a minute, the older orphan's reducer pass over [start, start + 1min) would sweep up the newer orphan's first block and the newer orphan's reducer would also see it, so the same activity ends up cited under two sessions.

Added test_recover_blockless_orphan_clamps_fallback_to_next_session_start to lock in the boundary — A at 09:00:00, B at 09:00:30, no blocks anywhere, asserts A.end_time == B.start rather than A.start + 1min. Without the clamp it fails with the off-by-30-seconds.

77/77 pass, ruff clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant