Skip to content

fix: conversation pipeline hardening (watermark, dead query, trim heuristic, erosion signal)#1440

Merged
njbrake merged 2 commits into
mainfrom
fix/conversation-hardening
Jun 10, 2026
Merged

fix: conversation pipeline hardening (watermark, dead query, trim heuristic, erosion signal)#1440
njbrake merged 2 commits into
mainfrom
fix/conversation-hardening

Conversation

@njbrake

@njbrake njbrake commented Jun 10, 2026

Copy link
Copy Markdown
Member

Note: this PR was drafted by Claude via back-and-forth with @njbrake. The reasoning and decisions are his; the prose is Claude's.

Description

Fixes #1433. Four small fixes from the conversation-design review, bundled as agreed on the issue.

1. Same-row trim atomicity (correctness). History rebuild expands one outbound DB row into a tool-call AssistantMessage, its ToolResultMessages, and a final-reply AssistantMessage, all sharing the row's seq (context._expand_outbound_with_tools). trim_messages treated the reply as a separate block, so it could drop the tool-call half while keeping the reply; the watermark then advanced over the shared seq and silently filtered the kept reply from the next turn's history, and the reply prose never reached the compactor. The reply now trims atomically with its block. Live-loop messages carry seq=None and are unaffected.

2. Remove the dead cross-session context section (perf/cleanup). Migration 026 collapsed sessions to one per user, so build_cross_session_context excluding the current session always returned empty; every message paid a wasted query. Removed the function, the get_other_session_messages_async store method, and the current_session_id plumbing through the prompt assemblers. All channels share the single session, so channel switches are already covered by ordinary history.

3. Accurate proactive trim (correctness of trigger). A fresh ClawboltAgent is built per message, so self._last_input_tokens was always 0 at the proactive trim and the decision fell back to the chars/4 + flat 10k-overhead heuristic, which ignores tool schemas and the real system prompt size, firing later than configured. A bounded process-local LRU now carries the last API-reported input_tokens per user across agent instances (cleared by reset_stores() for test isolation); the heuristic remains the cold-start fallback after a restart.

4. Memory erosion signal (observability). Compaction full-rewrites MEMORY.md and the compliance audit explicitly encourages deletion, so a valid line can vanish on any cycle with no signal. The compaction.summary log line now carries memory_lines_added / memory_lines_removed (multiset line diff, reorder-insensitive) so a log aggregator can alert on large unexplained removals.

Deliberate deviation from the issue: item 4's acceptance criterion asked for the counts on compaction_events rows. This PR puts them on the structured log line only, because adding a migration here would collide with migration 039 in PR #1438 (independent PRs, same Alembic head). Persisting the counts is a 10-line follow-up once 039 lands.

Type

  • Feature
  • Bug fix
  • Refactor
  • Test
  • CI/CD
  • Documentation

Checklist

  • Tests pass (uv run pytest -v) (2884 passed, 2 skipped; 6 dead cross-session tests removed, 5 added)
  • Lint passes (ruff check backend/ && ruff format --check backend/)
  • New tests added for new functionality
  • Bug fixes include regression tests

AI Usage

  • AI-assisted (describe how): Claude implemented all four fixes and their tests, with direction and review by @njbrake.
  • No AI used

…ristic, erosion signal)

Four small fixes from the conversation-design review, bundled per issue
#1433:

1. Same-row trim atomicity. History rebuild expands one outbound DB row
   into a tool-call AssistantMessage, its ToolResultMessages, and a
   final-reply AssistantMessage, all sharing the row's seq. trim_messages
   treated the reply as a separate block, so it could drop the tool-call
   half while keeping the reply; the watermark then advanced over the
   shared seq and silently filtered the kept reply from the next turn.
   The reply now trims atomically with its block (live-loop messages
   carry seq=None and are unaffected).

2. Remove the dead cross-session context section. Migration 026
   collapsed sessions to one per user, so the query excluding the
   current session always returned empty; every message paid a wasted
   DB query. Removed build_cross_session_context, the
   get_other_session_messages_async store method, and the
   current_session_id plumbing through the prompt assemblers.

3. Accurate proactive trim. A fresh ClawboltAgent is built per message,
   so the trim decision always used the chars/4 + flat-overhead
   heuristic, which ignores tool schemas and real system prompt size.
   A bounded process-local LRU now carries the last API-reported
   input_tokens per user across agent instances; the heuristic remains
   the cold-start fallback.

4. Memory erosion signal. Compaction full-rewrites MEMORY.md and the
   compliance audit encourages deletion, so a valid line can vanish on
   any cycle with no signal. The compaction.summary log line now carries
   memory_lines_added / memory_lines_removed (multiset diff,
   reorder-insensitive) so an aggregator can alert on large unexplained
   removals. Persisting the counts on compaction_events can follow once
   migration 039 (PR #1438) lands, avoiding an Alembic head collision.

Fixes #1433

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@njbrake, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 3 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f308ecaa-c543-4e93-912f-b8e75a5ae0f8

📥 Commits

Reviewing files that changed from the base of the PR and between f30e8ce and f24282f.

📒 Files selected for processing (11)
  • backend/app/agent/compaction.py
  • backend/app/agent/core.py
  • backend/app/agent/session_db.py
  • backend/app/agent/stores.py
  • backend/app/agent/system_prompt.py
  • backend/app/agent/trimming.py
  • backend/app/routers/user_sessions.py
  • tests/test_agent.py
  • tests/test_compaction.py
  • tests/test_session_db_async.py
  • tests/test_system_prompt.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/conversation-hardening
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/conversation-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…stem_prompt.py)

# Conflicts:
#	backend/app/agent/system_prompt.py
@njbrake njbrake merged commit 33418a0 into main Jun 10, 2026
10 checks passed
@njbrake njbrake deleted the fix/conversation-hardening branch June 10, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conversation pipeline hardening: partial-row watermark, dead cross-session query, trim heuristic, erosion metric

1 participant