Skip to content

fix: retry compaction events stuck in pending on startup#1438

Merged
njbrake merged 3 commits into
mainfrom
fix/retry-pending-compaction
Jun 10, 2026
Merged

fix: retry compaction events stuck in pending on startup#1438
njbrake merged 3 commits into
mainfrom
fix/retry-pending-compaction

Conversation

@njbrake

@njbrake njbrake commented Jun 10, 2026

Copy link
Copy Markdown
Member

Note: this PR was drafted by Claude via back-and-forth with @njbrake. The reasoning and decisions are his; the prose is Claude's.

Description

Fixes #1431.

trigger_compaction_for_dropped advances the trim watermark synchronously, then runs the compaction LLM call as a fire-and-forget background task. When the process dies mid-call (deploy restart, OOM) or the call fails (provider outage), the CompactionEvent row stays 'pending' forever: the watermark is already advanced, so the seq range never reaches the LLM again, and its facts are never extracted into MEMORY.md. The row records everything needed to recover (the seq range) and message rows are never deleted, but nothing acted on it.

New recover_pending_compactions startup sweep, mirroring the inbound_recovery pattern:

  • One worker per rolling restart: pg_try_advisory_lock on a dedicated AsyncConnection (same connection-pinning rationale documented in inbound_recovery.py).
  • Bounded selection: rows older than a 10-minute grace floor (never races a call legitimately still in flight), younger than the new COMPACTION_RETRY_LOOKBACK_MINUTES (default 7 days; unlike orphan inbounds, a stale compaction stays fully recoverable for as long as the rows exist, so the window is generous; 0 disables).
  • Bounded retries: each attempt increments the new retry_count column (migration 039, verified up/down/up) before the LLM call, so a crash mid-retry still consumes the attempt. Rows at the 3-attempt cap stay 'pending' for admin visibility but stop being selected; a poisoned range cannot retry forever. Ranges whose messages were deleted are exhausted immediately.
  • Replay fidelity: messages are reloaded by the recorded [min_message_seq, max_message_seq] range and rebuilt through the same _stored_messages_to_agent_messages conversion the live path uses, then compact_session runs with the original event_id so the same row is completed with full audit snapshots.
  • Hooked into lifespan after the other startup recoveries, same best-effort try/except shape.

Type

  • Feature
  • Bug fix
  • Refactor
  • Test
  • CI/CD
  • Documentation

Checklist

  • Tests pass (uv run pytest -v) (2892 passed, 2 skipped)
  • Lint passes (ruff check backend/ && ruff format --check backend/)
  • New tests added for new functionality (7 tests: recover, grace floor, lookback, attempt cap, failure accounting, deleted-range exhaustion, disable switch)
  • Bug fixes include regression tests

AI Usage

  • AI-assisted (describe how): Claude designed the sweep after the inbound_recovery pattern, implemented it with the migration, and wrote the tests, with direction and review by @njbrake.
  • No AI used

Overview

This PR adds a startup recovery sweep that retries compaction events left in "pending" when the asynchronous compaction LLM call fails or the process restarts mid-call, preventing trimmed message ranges from being lost.

What was added / changed

  • New recovery module: backend/app/agent/compaction_recovery.py — scans for stale pending CompactionEvent rows and re-runs compaction for their recorded [min_message_seq,max_message_seq] ranges.
  • Database migration: alembic/versions/039_add_compaction_event_retry_count.py — adds a non-null integer retry_count column (server default 0) to compaction_events.
  • Model: backend/app/models.py — CompactionEvent gains retry_count.
  • Config: backend/app/config.py and .env.example — new compaction_retry_lookback_minutes setting (default 10_080 = 7 days; 0 disables).
  • Startup wiring: backend/app/main.py — calls recover_pending_compactions() during app lifespan startup (after other recoveries) with guarded try/except and logging.
  • Docs: docs/self-host/configuration.md updated to document the new setting.
  • Tests: tests/test_compaction_recovery.py — seven tests covering success, grace floor, lookback window, attempt cap, failure handling, deleted-range exhaustion, and disabling the sweep.

High-level behavior and safeguards

  • Sweep selection: only considers events older than a 10-minute grace floor and younger than compaction_retry_lookback_minutes (0 disables).
  • Concurrency: serializes work with a per-process Postgres advisory lock (pg_try_advisory_lock) so one worker runs the sweep per rolling restart.
  • Retry bounding: retry_count is incremented under row lock before invoking the LLM so crashes still consume attempts; a 3-attempt cap (_MAX_ATTEMPTS) prevents infinite retries and leaves rows pending for admin visibility.
  • Empty-range handling: events whose message ranges are gone or filter to no recoverable messages are immediately exhausted (retry_count set to max) to stop future retries.
  • Replay fidelity: reloads stored messages for the recorded seq range, reconstructs agent messages, and calls compact_session with the original event_id so the same row is completed and original snapshots/audit remain intact.
  • Best-effort startup call: hooked into lifespan with try/except; returns the count of events completed and logs failures without preventing startup.

Benefits

  • Prevents permanent loss of facts from trimmed ranges when compaction background tasks fail or are interrupted.
  • Improves operational resilience by automatically recovering transient failures on startup.
  • Configurable and safe: operators can tune or disable the sweep; advisory locks and retry caps avoid duplicate work and endless loops.
  • Visibility: exhausted or persistently failing rows remain visible for admin inspection.

Technical notes (brief)

  • Migration: revision 039 adds compaction_events.retry_count (int, default 0).
  • Key constants: grace floor = 600 seconds (10 minutes); _MAX_ATTEMPTS = 3; default lookback = 10_080 minutes (7 days).
  • Implementation details: uses a dedicated AsyncConnection for pg_try_advisory_lock, selects pending events with non-null min/max seq and retry_count < _MAX_ATTEMPTS, claims attempts via SELECT ... FOR UPDATE and increments retry_count, rehydrates messages (Message rows → stored format → _stored_messages_to_agent_messages) and calls compact_session(user_id, agent_messages, max_message_seq=max_seq, event_id=event_id).

njbrake and others added 2 commits June 10, 2026 08:39
trigger_compaction_for_dropped advances the trim watermark
synchronously, then runs the compaction LLM call as a fire-and-forget
background task. When the process dies mid-call (deploy restart, OOM)
or the call fails (provider outage), the CompactionEvent row stays
'pending' forever: the watermark is already advanced, so the seq range
never reaches the LLM again and its facts are never extracted into
MEMORY.md. The row records everything needed to recover (the seq range)
and messages are never deleted, but nothing acted on it.

recover_pending_compactions sweeps stale pending rows on app startup,
mirroring the inbound_recovery pattern: per-process pg_try_advisory_lock
on a dedicated AsyncConnection, lookback window (new
compaction_retry_lookback_minutes setting, default 7 days, 0 disables),
best-effort semantics. Each row is claimed by incrementing the new
retry_count column (migration 039) before the LLM call so a crash
mid-retry still consumes the attempt; rows at the 3-attempt cap stay
'pending' for admin visibility but stop being selected, so a poisoned
range cannot retry forever. A 10-minute grace floor keeps the sweep
from racing a compaction call that is legitimately still in flight.
Ranges whose messages were deleted are exhausted immediately.

Fixes #1431

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Covers the .env.example and configuration.md completeness checks in
test_env_example.py for the setting added by the compaction retry sweep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a2d84a4-7815-4331-abba-15e79529f9ad

📥 Commits

Reviewing files that changed from the base of the PR and between d034b68 and 98b8b7e.

📒 Files selected for processing (1)
  • backend/app/config.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/config.py

Walkthrough

This PR implements a startup recovery sweep for compaction events stuck in 'pending' status after crashes. It introduces a retry_count column, a configurable lookback window, and a best-effort advisory-lock-protected sweep that rehydrates messages and re-runs compaction deterministically, with caps to prevent infinite retries.

Changes

Compaction Recovery Feature

Layer / File(s) Summary
Data Model & Configuration
alembic/versions/039_add_compaction_event_retry_count.py, backend/app/models.py, backend/app/config.py
Migration adds retry_count (non-null, default 0) to compaction_events table. ORM model gains the same field. Settings adds compaction_retry_lookback_minutes (default 1 week, 0 disables) to control the lookback and enable/disable the recovery sweep.
Recovery Logic Implementation
backend/app/agent/compaction_recovery.py
Core module with configuration constants, query builders for stale pending events and message ranges, Postgres advisory lock helpers, transactional claim/exhaust helpers using row locking to prevent races, per-event retry routine that loads session/user, rehydrates messages from stored seq bounds, converts to agent messages, exhausts un-compactable ranges, calls compact_session, and a main recover_pending_compactions() sweep that sequences retries and counts completions.
Startup Integration & Documentation
backend/app/main.py, .env.example, docs/self-host/configuration.md
Startup lifespan calls recover_pending_compactions() in guarded try/except, logging recovered count on success and exception on failure. Environment variable and docs describe COMPACTION_RETRY_LOOKBACK_MINUTES and disabling semantics (0).
Test Suite
tests/test_compaction_recovery.py
Seven scenarios covering recovery success (event retried/completed, retry_count incremented), skip fresh events (within grace), skip beyond lookback, skip exhausted attempts, failed retry keeps pending and increments retry_count, empty-range exhausts selection, and disabling the sweep when lookback is zero. Uses DB seeding and LLM mocking.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title follows the Conventional Commit format with 'fix:' prefix, uses imperative mood, and clearly summarizes the main change under 70 characters.
Description check ✅ Passed The description is comprehensive, addresses the linked issue, includes detailed implementation notes, and fully completes the provided template with all sections filled.
Linked Issues check ✅ Passed The PR fully satisfies #1431's acceptance criteria: pending rows are retried and complete on success, messages remain recoverable, repeated failures prevent infinite loops, and tests cover success and give-up paths.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the startup recovery sweep for pending compactions; no unrelated modifications are present.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/retry-pending-compaction
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/retry-pending-compaction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@njbrake njbrake merged commit 449f85b into main Jun 10, 2026
12 checks passed
@njbrake njbrake deleted the fix/retry-pending-compaction branch June 10, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pending compaction events are never retried; facts in the trimmed range are lost on crash

1 participant