Skip to content

fix: compare-and-swap memory file writes against compaction races#1436

Merged
njbrake merged 2 commits into
mainfrom
fix/compaction-rewrite-race
Jun 10, 2026
Merged

fix: compare-and-swap memory file writes against compaction races#1436
njbrake merged 2 commits into
mainfrom
fix/compaction-rewrite-race

Conversation

@njbrake

@njbrake njbrake commented Jun 9, 2026

Copy link
Copy Markdown
Member

Note: this PR was drafted by Claude via back-and-forth with @njbrake. The reasoning and decisions are his; the prose is Claude's.

Description

Fixes #1429.

compact_session reads MEMORY.md / USER.md / SOUL.md, runs an LLM call that takes tens of seconds, then writes full rewrites. Compaction runs as a fire-and-forget background task, so the conversation keeps going during the call: the agent's workspace tools can write a new fact in that window, and a second compaction for the same user can land first (the snapshot logic already acknowledges concurrent compact_session tasks). The blind overwrite then clobbers the newer value silently, including explicit user saves. HISTORY.md was already protected (advisory lock + FOR UPDATE in append_history); MEMORY/USER/SOUL were last-writer-wins.

Changes:

  • write_memory_async / write_user_async / write_soul_async accept an optional keyword-only expected_current and return bool. When provided, the compare runs inside the write transaction under FOR UPDATE (plus the per-user advisory lock for the no-row-yet MEMORY.md branch, mirroring append_history). On drift the write is skipped and False is returned. Comparison uses the same normalized form the corresponding read_*_async returns, so callers pass exactly what they read.
  • compact_session passes its top-of-function reads as expected_current, logs a warning on a CAS miss, and records memory_updated=False (etc.) on the audit row so the before/after snapshots stay truthful.

Tradeoff, made deliberately: on a CAS miss the batch's extraction is dropped rather than re-merged with a second LLM call. Losing one batch's extraction is recoverable (the conversation sent to the LLM is preserved in the event row's prompt_text audit column, migration 031); clobbering a durable file is not. A bounded re-merge retry would be a reasonable follow-up if CAS misses turn out to be common; the new warning log line makes that measurable.

Plain calls without expected_current (e.g. the user-facing memory editor in user_memory.py) keep last-writer-wins semantics, which is correct for direct user edits.

Type

  • Feature
  • Bug fix
  • Refactor
  • Test
  • CI/CD
  • Documentation

Checklist

  • Tests pass (uv run pytest -v) (2892 passed, 2 skipped)
  • Lint passes (ruff check backend/ && ruff format --check backend/)
  • New tests added for new functionality
  • Bug fixes include regression tests (CAS unit tests for all three files plus an end-to-end mid-compaction-write race test)

AI Usage

  • AI-assisted (describe how): Claude analyzed the race, implemented the CAS writes, and wrote the tests, with direction and review by @njbrake.
  • No AI used

Overview

This PR fixes a critical race condition where compaction operations could silently overwrite concurrent writes to memory files (MEMORY.md, USER.md, SOUL.md) during long-running LLM operations. The solution implements compare-and-swap (CAS) semantics to detect and skip writes when the underlying file has changed since the initial read.

What Changed

Compaction Process (compaction.py)

  • Modified compact_session to use compare-and-swap when persisting LLM-generated updates
  • Instead of unconditionally overwriting memory files, the method now validates that the file hasn't changed since the initial read
  • If a concurrent write is detected, the compaction update is skipped (rather than clobbering the new content) and logged appropriately
  • Compaction audit records now accurately reflect whether an update succeeded or was skipped

Memory Storage Layer (memory_db.py)

  • Enhanced three write methods (write_memory_async, write_user_async, write_soul_async) to support optional compare-and-swap validation
  • When CAS is enabled, writes acquire database locks and perform atomic check-then-write operations
  • Methods now return a boolean indicating success/failure instead of None
  • If the stored content has changed since the initial read, the write is safely skipped without overwriting

What Was Added

Test Coverage

  • New end-to-end test simulating a mid-compaction concurrent write to validate correct race condition handling
  • Six dedicated CAS unit tests covering match/mismatch scenarios for memory, user, and soul files
  • Tests verify that concurrent writes are preserved rather than lost

Benefits

  • Eliminates silent data loss: Concurrent writes during compaction are no longer lost
  • Maintains accuracy: Compaction audit records now truthfully reflect what changed
  • Graceful degradation: On race detection, the compaction batch is dropped but remains recoverable from the event log
  • Backward compatible: Regular user-facing writes (without CAS) maintain simple last-writer-wins semantics
  • Production-safe: Database-level locking ensures correctness even with multiple concurrent compactions

compact_session reads MEMORY.md / USER.md / SOUL.md, runs an LLM call
that takes tens of seconds, then writes full rewrites. Compaction runs
as a background task, so the conversation keeps going during the call:
the agent's workspace tools can write a new fact in that window, and a
second compaction for the same user can land first (the code already
acknowledges concurrent compact_session tasks). The blind overwrite
then clobbers the newer value, silently, including explicit user saves.
HISTORY.md was already protected (advisory lock + FOR UPDATE in
append_history); the other three files were last-writer-wins.

write_memory_async / write_user_async / write_soul_async now accept an
optional expected_current and perform the compare inside the write
transaction under FOR UPDATE (plus the per-user advisory lock for the
no-row-yet MEMORY.md branch, mirroring append_history). On drift the
write is skipped and False is returned; compaction logs the skip and
records memory_updated=False on the audit row. Losing one batch's
extraction is recoverable (the conversation sent to the LLM is kept in
the event row's prompt_text audit column); clobbering a durable file is
not.

Fixes #1429

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

This PR implements compare-and-swap (CAS) semantics to prevent race-condition data loss in compaction. When compact_session rewrites memory documents after a long LLM call, concurrent agent-tool writes or inter-compaction edits are no longer silently lost; instead, the write is skipped and logged.

Changes

Compare-and-swap protection for compaction writes to memory documents

Layer / File(s) Summary
Lock-based CAS infrastructure
backend/app/agent/memory_db.py
Adds _user_select_for_update() builder that performs FOR UPDATE on the User row, establishing atomic locked-read semantics for subsequent CAS write methods.
Compare-and-swap write methods
backend/app/agent/memory_db.py
Updates write_memory_async, write_soul_async, and write_user_async to accept optional expected_current parameter and return bool. Acquires per-user advisory lock and row-level lock; returns False and rolls back on mismatch, True on success.
Unit test coverage for CAS semantics
tests/test_memory_db_async.py
Validates CAS match (write succeeds), mismatch (concurrent write blocks stale CAS), and empty-expected-current (missing row) scenarios across all three write methods.
Compaction integration using CAS writes
backend/app/agent/compaction.py
Updates compact_session to use CAS-based writes for MEMORY, USER, and SOUL rewrites. On CAS miss, clears the *_changed flag and logs "file changed since read" to prevent blind overwrites of concurrent edits.
Compaction concurrent-write regression test
tests/test_compaction.py
Simulates a concurrent write_memory_async during the LLM call and verifies the compaction result respects the concurrent change (empty memory update, preserved fact, CompactionEvent with memory_updated=False).

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title uses correct Conventional Commit 'fix:' prefix, imperative mood, and is concise at 65 characters; clearly describes the compare-and-swap mechanism addressing race conditions.
Description check ✅ Passed Description includes detailed explanation of the fix, type selection, comprehensive checklist completion, and clear AI usage disclosure; all template sections are addressed.
Linked Issues check ✅ Passed Changes fully implement all acceptance criteria from #1429: compare-and-swap detection prevents blind overwrites, concurrent compactions are protected, and regression tests cover both unit and end-to-end scenarios.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the race condition in compaction writes; no unrelated refactoring, dependency updates, or infrastructure changes are present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/compaction-rewrite-race
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/compaction-rewrite-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@njbrake njbrake merged commit a5ba926 into main Jun 10, 2026
12 checks passed
@njbrake njbrake deleted the fix/compaction-rewrite-race branch June 10, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compaction full rewrites of MEMORY/USER/SOUL race with live agent writes (lost updates)

1 participant