fix: retry compaction events stuck in pending on startup by njbrake · Pull Request #1438 · mozilla-ai/clawbolt

njbrake · 2026-06-10T08:42:14Z

Note: this PR was drafted by Claude via back-and-forth with @njbrake. The reasoning and decisions are his; the prose is Claude's.

Description

Fixes #1431.

trigger_compaction_for_dropped advances the trim watermark synchronously, then runs the compaction LLM call as a fire-and-forget background task. When the process dies mid-call (deploy restart, OOM) or the call fails (provider outage), the CompactionEvent row stays 'pending' forever: the watermark is already advanced, so the seq range never reaches the LLM again, and its facts are never extracted into MEMORY.md. The row records everything needed to recover (the seq range) and message rows are never deleted, but nothing acted on it.

New recover_pending_compactions startup sweep, mirroring the inbound_recovery pattern:

One worker per rolling restart: pg_try_advisory_lock on a dedicated AsyncConnection (same connection-pinning rationale documented in inbound_recovery.py).
Bounded selection: rows older than a 10-minute grace floor (never races a call legitimately still in flight), younger than the new COMPACTION_RETRY_LOOKBACK_MINUTES (default 7 days; unlike orphan inbounds, a stale compaction stays fully recoverable for as long as the rows exist, so the window is generous; 0 disables).
Bounded retries: each attempt increments the new retry_count column (migration 039, verified up/down/up) before the LLM call, so a crash mid-retry still consumes the attempt. Rows at the 3-attempt cap stay 'pending' for admin visibility but stop being selected; a poisoned range cannot retry forever. Ranges whose messages were deleted are exhausted immediately.
Replay fidelity: messages are reloaded by the recorded [min_message_seq, max_message_seq] range and rebuilt through the same _stored_messages_to_agent_messages conversion the live path uses, then compact_session runs with the original event_id so the same row is completed with full audit snapshots.
Hooked into lifespan after the other startup recoveries, same best-effort try/except shape.

Type

Checklist

Tests pass (uv run pytest -v) (2892 passed, 2 skipped)
Lint passes (ruff check backend/ && ruff format --check backend/)
New tests added for new functionality (7 tests: recover, grace floor, lookback, attempt cap, failure accounting, deleted-range exhaustion, disable switch)
Bug fixes include regression tests

AI Usage

AI-assisted (describe how): Claude designed the sweep after the inbound_recovery pattern, implemented it with the migration, and wrote the tests, with direction and review by @njbrake.
No AI used

Overview

This PR adds a startup recovery sweep that retries compaction events left in "pending" when the asynchronous compaction LLM call fails or the process restarts mid-call, preventing trimmed message ranges from being lost.

What was added / changed

New recovery module: backend/app/agent/compaction_recovery.py — scans for stale pending CompactionEvent rows and re-runs compaction for their recorded [min_message_seq,max_message_seq] ranges.
Database migration: alembic/versions/039_add_compaction_event_retry_count.py — adds a non-null integer retry_count column (server default 0) to compaction_events.
Model: backend/app/models.py — CompactionEvent gains retry_count.
Config: backend/app/config.py and .env.example — new compaction_retry_lookback_minutes setting (default 10_080 = 7 days; 0 disables).
Startup wiring: backend/app/main.py — calls recover_pending_compactions() during app lifespan startup (after other recoveries) with guarded try/except and logging.
Docs: docs/self-host/configuration.md updated to document the new setting.
Tests: tests/test_compaction_recovery.py — seven tests covering success, grace floor, lookback window, attempt cap, failure handling, deleted-range exhaustion, and disabling the sweep.

High-level behavior and safeguards

Sweep selection: only considers events older than a 10-minute grace floor and younger than compaction_retry_lookback_minutes (0 disables).
Concurrency: serializes work with a per-process Postgres advisory lock (pg_try_advisory_lock) so one worker runs the sweep per rolling restart.
Retry bounding: retry_count is incremented under row lock before invoking the LLM so crashes still consume attempts; a 3-attempt cap (_MAX_ATTEMPTS) prevents infinite retries and leaves rows pending for admin visibility.
Empty-range handling: events whose message ranges are gone or filter to no recoverable messages are immediately exhausted (retry_count set to max) to stop future retries.
Replay fidelity: reloads stored messages for the recorded seq range, reconstructs agent messages, and calls compact_session with the original event_id so the same row is completed and original snapshots/audit remain intact.
Best-effort startup call: hooked into lifespan with try/except; returns the count of events completed and logs failures without preventing startup.

Benefits

Prevents permanent loss of facts from trimmed ranges when compaction background tasks fail or are interrupted.
Improves operational resilience by automatically recovering transient failures on startup.
Configurable and safe: operators can tune or disable the sweep; advisory locks and retry caps avoid duplicate work and endless loops.
Visibility: exhausted or persistently failing rows remain visible for admin inspection.

Technical notes (brief)

Migration: revision 039 adds compaction_events.retry_count (int, default 0).
Key constants: grace floor = 600 seconds (10 minutes); _MAX_ATTEMPTS = 3; default lookback = 10_080 minutes (7 days).
Implementation details: uses a dedicated AsyncConnection for pg_try_advisory_lock, selects pending events with non-null min/max seq and retry_count < _MAX_ATTEMPTS, claims attempts via SELECT ... FOR UPDATE and increments retry_count, rehydrates messages (Message rows → stored format → _stored_messages_to_agent_messages) and calls compact_session(user_id, agent_messages, max_message_seq=max_seq, event_id=event_id).

trigger_compaction_for_dropped advances the trim watermark synchronously, then runs the compaction LLM call as a fire-and-forget background task. When the process dies mid-call (deploy restart, OOM) or the call fails (provider outage), the CompactionEvent row stays 'pending' forever: the watermark is already advanced, so the seq range never reaches the LLM again and its facts are never extracted into MEMORY.md. The row records everything needed to recover (the seq range) and messages are never deleted, but nothing acted on it. recover_pending_compactions sweeps stale pending rows on app startup, mirroring the inbound_recovery pattern: per-process pg_try_advisory_lock on a dedicated AsyncConnection, lookback window (new compaction_retry_lookback_minutes setting, default 7 days, 0 disables), best-effort semantics. Each row is claimed by incrementing the new retry_count column (migration 039) before the LLM call so a crash mid-retry still consumes the attempt; rows at the 3-attempt cap stay 'pending' for admin visibility but stop being selected, so a poisoned range cannot retry forever. A 10-minute grace floor keeps the sweep from racing a compaction call that is legitimately still in flight. Ranges whose messages were deleted are exhausted immediately. Fixes #1431 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Covers the .env.example and configuration.md completeness checks in test_env_example.py for the setting added by the compaction retry sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-10T08:45:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a2d84a4-7815-4331-abba-15e79529f9ad

📥 Commits

Reviewing files that changed from the base of the PR and between d034b68 and 98b8b7e.

📒 Files selected for processing (1)

backend/app/config.py

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/config.py

Walkthrough

This PR implements a startup recovery sweep for compaction events stuck in 'pending' status after crashes. It introduces a retry_count column, a configurable lookback window, and a best-effort advisory-lock-protected sweep that rehydrates messages and re-runs compaction deterministically, with caps to prevent infinite retries.

Changes

Compaction Recovery Feature

Layer / File(s)	Summary
Data Model & Configuration `alembic/versions/039_add_compaction_event_retry_count.py`, `backend/app/models.py`, `backend/app/config.py`	Migration adds `retry_count` (non-null, default `0`) to `compaction_events` table. ORM model gains the same field. Settings adds `compaction_retry_lookback_minutes` (default 1 week, `0` disables) to control the lookback and enable/disable the recovery sweep.
Recovery Logic Implementation `backend/app/agent/compaction_recovery.py`	Core module with configuration constants, query builders for stale pending events and message ranges, Postgres advisory lock helpers, transactional claim/exhaust helpers using row locking to prevent races, per-event retry routine that loads session/user, rehydrates messages from stored seq bounds, converts to agent messages, exhausts un-compactable ranges, calls `compact_session`, and a main `recover_pending_compactions()` sweep that sequences retries and counts completions.
Startup Integration & Documentation `backend/app/main.py`, `.env.example`, `docs/self-host/configuration.md`	Startup lifespan calls `recover_pending_compactions()` in guarded `try/except`, logging recovered count on success and exception on failure. Environment variable and docs describe `COMPACTION_RETRY_LOOKBACK_MINUTES` and disabling semantics (`0`).
Test Suite `tests/test_compaction_recovery.py`	Seven scenarios covering recovery success (event retried/completed, `retry_count` incremented), skip fresh events (within grace), skip beyond lookback, skip exhausted attempts, failed retry keeps pending and increments `retry_count`, empty-range exhausts selection, and disabling the sweep when lookback is zero. Uses DB seeding and LLM mocking.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title follows the Conventional Commit format with 'fix:' prefix, uses imperative mood, and clearly summarizes the main change under 70 characters.
Description check	✅ Passed	The description is comprehensive, addresses the linked issue, includes detailed implementation notes, and fully completes the provided template with all sections filled.
Linked Issues check	✅ Passed	The PR fully satisfies `#1431`'s acceptance criteria: pending rows are retried and complete on success, messages remain recoverable, repeated failures prevent infinite loops, and tests cover success and give-up paths.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the startup recovery sweep for pending compactions; no unrelated modifications are present.
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/retry-pending-compaction

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch fix/retry-pending-compaction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

njbrake and others added 2 commits June 10, 2026 08:39

docs: document COMPACTION_RETRY_LOOKBACK_MINUTES

d034b68

Covers the .env.example and configuration.md completeness checks in test_env_example.py for the setting added by the compaction retry sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This was referenced Jun 10, 2026

fix: cover the trim-to-compaction amnesia window with a continuity note #1439

Merged

fix: conversation pipeline hardening (watermark, dead query, trim heuristic, erosion signal) #1440

Merged

Merge branch 'main' into fix/retry-pending-compaction

98b8b7e

njbrake merged commit 449f85b into main Jun 10, 2026
12 checks passed

njbrake deleted the fix/retry-pending-compaction branch June 10, 2026 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry compaction events stuck in pending on startup#1438

fix: retry compaction events stuck in pending on startup#1438
njbrake merged 3 commits into
mainfrom
fix/retry-pending-compaction

njbrake commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

njbrake commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type

Checklist

AI Usage

Overview

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

njbrake commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading