fix(session): recover orphan active sessions on daemon startup by Chen17-sq · Pull Request #18 · Einsia/OpenChronicle

Chen17-sq · 2026-04-27T07:55:54Z

Summary — silent data loss across hard crashes

The graceful shutdown path calls SessionManager.force_end so any in-progress session is persisted as ended before exit. A SIGKILL / OOM kill / kernel panic / power loss skips that — the sessions row stays at status='active' with end_time=NULL, and the daily safety-net's reduce_all_pending query explicitly excludes active rows:

# session/store.py:list_pending_reduction
WHERE status IN ('ended', 'failed') AND end_time IS NOT NULL

The session is then never reduced, never classified, and silently absent from event-*.md — a full block of the user's day is lost without any error or log entry. The bug surfaces only when reproduced: kill -9 the daemon mid-session, restart, watch the orphan row sit forever.

Fix

session/recovery.recover_orphan_sessions(cfg) runs once at daemon startup, before build_manager (so the new SessionManager can't race the recovery). It lists every active row, infers a plausible end_time from the timeline blocks the previous run did persist, and force-ends each row so the existing catch-up machinery picks it up on the next pass.

End-time inference adds two SQL helpers:

timeline_store.latest_end_in_window(conn, start, end) — most recent block end whose start_time ∈ [start, end).
session_store.next_session_start_after(conn, start_time) — earliest start_time of any session that opened after this one. Used as the upper bound so two consecutive orphans (e.g. two crashes with one normal session in between) don't claim each other's blocks.

The inferred window is [start_time, min(next_session_start, start_time + max_session_hours)). If no block exists in that window (the crash beat the 1-min aggregator), end_time falls back to start_time + 1 minute and the reducer's empty-window branch marks the row reduced as a no-op — the orphan is gone either way.

Out of scope (intentionally)

Firing the reducer immediately on startup. Recovered sessions are reduced by the next daily safety-net tick (worst case ~24h latency). A faster-recovery follow-up can fold in the on_done + classifier wiring without changing the storage layer.
Age-based filtering. Even year-old orphans are recovered; if the blocks have been pruned, the empty-window branch cleans them up naturally.

Relationship to #11 / #12

Independent — different files (daemon.py, session/, timeline/store.py), no overlap. Mergeable in any order.

Test plan

uv run pytest — 76/76 (69 existing + 7 new)
uv run ruff check src/ tests/test_session_recovery.py — clean
Edge case manually traced: two consecutive hard-crashes with one good session in the middle — next_session_start_after query is status-agnostic, so the good session's start still bounds the older orphan's window. Working correctly.

New tests:

test_recover_no_orphans_is_noop
test_recover_orphan_with_blocks_uses_latest_block_end
test_recover_orphan_with_no_blocks_uses_one_minute_fallback
test_recover_two_orphans_dont_pollute_each_other ← the key edge case: A's window must not extend into B's blocks. Without next_session_start_after this would fail.
test_recover_caps_at_max_session_hours
test_recover_is_idempotent
test_recovered_session_visible_to_reduce_all_pending_query — integration check that the recovered row matches the safety-net's catch-up SQL.

The graceful shutdown path calls ``SessionManager.force_end`` so any in-progress session is persisted as ``ended`` before exit. A SIGKILL / OOM kill / kernel panic / power loss skips that — the ``sessions`` row stays at ``status='active'`` with ``end_time=NULL``, and the daily safety-net's ``reduce_all_pending`` query explicitly excludes ``active`` rows (``session/store.list_pending_reduction`` filters ``status IN ('ended', 'failed') AND end_time IS NOT NULL``). The session is then never reduced, never classified, and silently absent from ``event-*.md``: a full block of the user's day is lost without any error or log. This adds ``session/recovery.recover_orphan_sessions(cfg)``, called once at daemon startup before ``build_manager`` (so the new SessionManager can't race the recovery). It lists every ``active`` row, infers a plausible ``end_time`` from the timeline blocks the previous run did persist, and force-ends each row so the existing catch-up machinery picks it up on the next pass. End-time inference uses two new SQL helpers: * ``timeline_store.latest_end_in_window(conn, start, end)`` — most recent block end whose ``start_time ∈ [start, end)``. * ``session_store.next_session_start_after(conn, start_time)`` — the earliest start_time of any session that opened after this one. Used as the upper bound on the orphan's plausible window so two consecutive orphans don't claim each other's blocks. The window is ``[start_time, min(next_session_start, start_time + max_session_hours))``. If no block exists in that window (the crash beat the 1-min aggregator), end_time falls back to ``start_time + 1 minute`` so the reducer's empty-window code path runs and marks the row ``reduced`` as a no-op. Out of scope (intentionally): * Firing the reducer immediately on startup. Recovered sessions are reduced by the next daily safety-net tick (worst case ~24h latency). A faster-recovery follow-up can fold in the on_done + classifier wiring without changing the storage layer. * Age-based filtering. Even year-old orphans are recovered; if the blocks have been pruned, the empty-window branch cleans them up naturally. Tests (``threading.Barrier``-style precision is unnecessary here — recovery is single-threaded, single-DB, deterministic): * ``test_recover_no_orphans_is_noop`` — empty DB, 0 returned. * ``test_recover_orphan_with_blocks_uses_latest_block_end`` — happy path: orphan plus three blocks, ends at the latest block end. * ``test_recover_orphan_with_no_blocks_uses_one_minute_fallback`` — crash before aggregator wrote a block. * ``test_recover_two_orphans_dont_pollute_each_other`` — the key edge case: A's window must not extend into B's blocks. Without ``next_session_start_after`` this fails. * ``test_recover_caps_at_max_session_hours`` — a stray far-future block can't stretch the orphan past the 2h ceiling. * ``test_recover_is_idempotent`` — second call sees no active rows. * ``test_recovered_session_visible_to_reduce_all_pending_query`` — integration: the recovered row matches the safety-net's catch-up SQL. 76/76 pass; ruff clean. Independent of Einsia#11 / Einsia#12 (different files; either order merges cleanly).

gemini-code-assist

Code Review

This pull request implements a session recovery mechanism to handle 'orphan' active sessions resulting from hard crashes. It introduces a new recovery module that infers session end times from persisted timeline blocks or a fallback duration, ensuring these sessions can be processed by the daily safety-net. The changes include new database queries and comprehensive tests. A review comment suggests capping the fallback end time by the next session's start time to ensure sessions remain disjoint.

gemini-code-assist · 2026-04-27T07:57:48Z

+        # cuts are aligned, but be defensive).
+        return min(block_end, upper_bound)
+
+    return start_time + _EMPTY_SESSION_FALLBACK


The fallback end_time should be capped by upper_bound to ensure it does not overlap with the start of the next session. This maintains the invariant that sessions are disjoint, even in cases where a crash occurs very shortly before a subsequent session begins.

Suggested change

return start_time + _EMPTY_SESSION_FALLBACK

return min(start_time + _EMPTY_SESSION_FALLBACK, upper_bound)

Address review feedback: ``_infer_end_time``'s fallback path for an orphan with no persisted timeline blocks was returning ``start_time + 1 minute`` unconditionally. When two consecutive hard crashes leave back-to-back orphans within a minute of each other, that pushes the older orphan's end_time past the newer one's start, and the eventual reducer pass over ``[start, end)`` would double-attribute the newer session's first block. Apply ``min(..., upper_bound)`` so the fallback respects the same disjoint-sessions invariant the block-based path already enforced. Adds ``test_recover_blockless_orphan_clamps_fallback_to_next_session_start`` to lock in the boundary. 77/77 pass; ruff clean.

Chen17-sq · 2026-04-27T08:02:09Z

Addressed in b27ad59 — applied the clamp exactly as suggested. The interesting half is the failure mode: it's not just an "overlap" concern, it's a double-attribution bug. With back-to-back hard crashes inside a minute, the older orphan's reducer pass over [start, start + 1min) would sweep up the newer orphan's first block and the newer orphan's reducer would also see it, so the same activity ends up cited under two sessions.

Added test_recover_blockless_orphan_clamps_fallback_to_next_session_start to lock in the boundary — A at 09:00:00, B at 09:00:30, no blocks anywhere, asserts A.end_time == B.start rather than A.start + 1min. Without the clamp it fails with the off-by-30-seconds.

77/77 pass, ruff clean.

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Chen17-sq mentioned this pull request Apr 27, 2026

fix(writer): explicit timeout + retries on every LLM call #19

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session): recover orphan active sessions on daemon startup#18

fix(session): recover orphan active sessions on daemon startup#18
Chen17-sq wants to merge 2 commits intoEinsia:mainfrom
Chen17-sq:fix/recover-orphan-active-sessions

Chen17-sq commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

Chen17-sq commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return start_time + _EMPTY_SESSION_FALLBACK
	return min(start_time + _EMPTY_SESSION_FALLBACK, upper_bound)

Conversation

Chen17-sq commented Apr 27, 2026

Summary — silent data loss across hard crashes

Fix

Out of scope (intentionally)

Relationship to #11 / #12

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Chen17-sq commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant