fix: retry compaction events stuck in pending on startup#1438
Conversation
trigger_compaction_for_dropped advances the trim watermark synchronously, then runs the compaction LLM call as a fire-and-forget background task. When the process dies mid-call (deploy restart, OOM) or the call fails (provider outage), the CompactionEvent row stays 'pending' forever: the watermark is already advanced, so the seq range never reaches the LLM again and its facts are never extracted into MEMORY.md. The row records everything needed to recover (the seq range) and messages are never deleted, but nothing acted on it. recover_pending_compactions sweeps stale pending rows on app startup, mirroring the inbound_recovery pattern: per-process pg_try_advisory_lock on a dedicated AsyncConnection, lookback window (new compaction_retry_lookback_minutes setting, default 7 days, 0 disables), best-effort semantics. Each row is claimed by incrementing the new retry_count column (migration 039) before the LLM call so a crash mid-retry still consumes the attempt; rows at the 3-attempt cap stay 'pending' for admin visibility but stop being selected, so a poisoned range cannot retry forever. A 10-minute grace floor keeps the sweep from racing a compaction call that is legitimately still in flight. Ranges whose messages were deleted are exhausted immediately. Fixes #1431 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Covers the .env.example and configuration.md completeness checks in test_env_example.py for the setting added by the compaction retry sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThis PR implements a startup recovery sweep for compaction events stuck in ChangesCompaction Recovery Feature
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Description
Fixes #1431.
trigger_compaction_for_droppedadvances the trim watermark synchronously, then runs the compaction LLM call as a fire-and-forget background task. When the process dies mid-call (deploy restart, OOM) or the call fails (provider outage), theCompactionEventrow stays'pending'forever: the watermark is already advanced, so the seq range never reaches the LLM again, and its facts are never extracted into MEMORY.md. The row records everything needed to recover (the seq range) and message rows are never deleted, but nothing acted on it.New
recover_pending_compactionsstartup sweep, mirroring theinbound_recoverypattern:pg_try_advisory_lockon a dedicatedAsyncConnection(same connection-pinning rationale documented ininbound_recovery.py).COMPACTION_RETRY_LOOKBACK_MINUTES(default 7 days; unlike orphan inbounds, a stale compaction stays fully recoverable for as long as the rows exist, so the window is generous;0disables).retry_countcolumn (migration 039, verified up/down/up) before the LLM call, so a crash mid-retry still consumes the attempt. Rows at the 3-attempt cap stay'pending'for admin visibility but stop being selected; a poisoned range cannot retry forever. Ranges whose messages were deleted are exhausted immediately.[min_message_seq, max_message_seq]range and rebuilt through the same_stored_messages_to_agent_messagesconversion the live path uses, thencompact_sessionruns with the originalevent_idso the same row is completed with full audit snapshots.lifespanafter the other startup recoveries, same best-effort try/except shape.Type
Checklist
uv run pytest -v) (2892 passed, 2 skipped)ruff check backend/ && ruff format --check backend/)AI Usage
Overview
This PR adds a startup recovery sweep that retries compaction events left in "pending" when the asynchronous compaction LLM call fails or the process restarts mid-call, preventing trimmed message ranges from being lost.
What was added / changed
High-level behavior and safeguards
Benefits
Technical notes (brief)