Skip to content

fix(cuga_lite): reset task_todos state between invocations#315

Open
Sergey-Zeltyn wants to merge 1 commit into
mainfrom
fix/cross-task-leak-todos
Open

fix(cuga_lite): reset task_todos state between invocations#315
Sergey-Zeltyn wants to merge 1 commit into
mainfrom
fix/cross-task-leak-todos

Conversation

@Sergey-Zeltyn

@Sergey-Zeltyn Sergey-Zeltyn commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Bug fix

Fixes #314

Summary

When advanced_features.enable_todos=true, the closure-scoped task_todos_ref (plus the parallel state.task_todos checkpointer field) outlived a single agent.invoke(). From the second task onward the previous task's plan leaked into the next task's turn-1 system prompt as ## Current task todos, biasing the model's first reasoning step before it had any chance to call create_update_todos for the new task.

Root cause: CugaAgent._create_graph memoizes on self._graph and only resets in add_tool() (sdk.py:2224), so every .invoke() reuses the same compiled graph and the same closure-captured list created in create_cuga_lite_graph (cuga_lite_graph.py:180). The tool body in todos.py:123-126 only does clear() + extend(new) — it never resets on a new task. Nothing in the SDK, agent loop, or framework cleared these between invocations.

Fix: at the top of prepare_tools_and_apps (prepare_node.py), detect a fresh conversation via len(state.chat_messages) <= 1 and clear both adapter._task_todos_ref and state.task_todos. The guard preserves HITL resume (turn 2+ keeps the in-flight plan).

Verification

Reproduced and verified on a 43-task AppWorld eval bundle (gpt-4.1, enable_todos=true):

Metric Before fix After fix
Traces with ## Current task todos in turn-1 system prompt 42 / 43 0 / 43
Traces with ## Current Plan in turn-1 (state path) 0 / 43
Mean fraction of LLM turns spent on todo-only blocks 7.1% 5.6%

Concrete example from the pre-fix bundle — task asking about HR/candidates/report_template.md opens with a stale Gmail plan injected as authoritative:

## Current task todos
1. **[pending]** Fetch all unread email threads from Gmail inbox (handle pagination)
2. **[pending]** Filter threads for priority-1
3. **[pending]** Count the number of priority-1 unread threads
4. **[pending]** Return the count to the user

After the fix, every one of the 43 traces starts turn 1 with a clean system prompt.

Testing

  • Verified fix locally; tests pass
  • Re-ran 43-task AppWorld eval with enable_todos=true: leak signature dropped from 42/43 to 0/43
  • Existing unit tests under cuga_lite/tests/test_agent_graph_adapter.py exercise prepare_system_content directly (not prepare_tools_and_apps), so they're unaffected by the new clear logic

Summary by CodeRabbit

  • Bug Fixes
    • Fixed an issue where previous task information could persist into new conversation turns. The system now properly clears state when starting a fresh conversation to prevent unintended data carry-over between separate interactions.

The closure-scoped `task_todos_ref` (and the parallel `state.task_todos`
checkpointer field) outlived a single `agent.invoke()`. With
`enable_todos=true`, the previous task's plan leaked into the next task's
turn-1 system prompt as `## Current task todos`, biasing the model's
first reasoning step toward an unrelated plan.

Clear both at the top of `prepare_tools_and_apps`, gated on a fresh
conversation (`len(chat_messages) <= 1`) so HITL resume is unaffected.

Fixes #314.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Sergey Zeltyn <sergeyz@il.ibm.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR fixes a bug where task plans from a prior CugaAgent.invoke() call leak into the system prompt of subsequent invocations when todos are enabled. A guard at the start of prepare_tools_and_apps now clears both the closure-scoped adapter._task_todos_ref and state.task_todos when a fresh conversation begins.

Changes

Task todos reset on new conversation

Layer / File(s) Summary
Clear task todos on conversation start
src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py
When chat_messages is empty or contains only the first message, the function clears adapter._task_todos_ref and resets state.task_todos to None to prevent stale task plans from contaminating the turn-1 system prompt across multiple invocations.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A todo list that wouldn't fade,
From task to task, its mark was made!
But now each chat begins fresh and clean,
No phantom plans will cross between.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main fix: resetting task_todos state between invocations to prevent cross-task leakage, which is the core objective of the PR.
Linked Issues check ✅ Passed The PR implements the exact fix suggested in issue #314: clearing adapter._task_todos_ref and state.task_todos at the top of prepare_tools_and_apps when a fresh conversation is detected (len(state.chat_messages) <= 1), with verification showing the bug is resolved.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing issue #314; no extraneous modifications, refactoring, or unrelated features are present in the diff.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cross-task-leak-todos

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py`:
- Around line 73-80: Run ruff formatting and linting on the cuga_graph/nodes
changes: execute "ruff format" then "ruff check" (or use the project's uv
wrapper if available) targeting the src/cuga/backend/cuga_graph/nodes/ tree so
prepare_node.py and related files are formatted and any lint errors are
surfaced; after fixing any reported issues re-run to ensure
adapter._task_todos_ref usage and state.task_todos handling in prepare_node.py
meet ruff rules.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0cae5ba2-d548-4644-857a-63d0aa75c3b8

📥 Commits

Reviewing files that changed from the base of the PR and between 5bd8644 and 1780408.

📒 Files selected for processing (1)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py

Comment on lines +73 to +80
# The adapter's task_todos_ref is closure-scoped per compiled graph and
# outlives a single .invoke(); state.task_todos can also persist via a
# thread-keyed checkpointer. Reset both at the start of a new
# conversation so a previous task's plan doesn't leak into this one's
# turn-1 system prompt.
if len(state.chat_messages or []) <= 1:
adapter._task_todos_ref.clear()
state.task_todos = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

uv run ruff format src/cuga/backend/cuga_graph/nodes/
uv run ruff check src/cuga/backend/cuga_graph/nodes/

Repository: cuga-project/cuga-agent

Length of output: 151


Run Ruff format + Ruff check for cuga_graph/nodes changes

The required ruff format and ruff check commands haven’t been executed in this environment (the uv run ... approach requires uv to be available). Run formatting and linting for src/cuga/backend/cuga_graph/nodes/ (via uv if available, otherwise ruff format + ruff check directly).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py` around
lines 73 - 80, Run ruff formatting and linting on the cuga_graph/nodes changes:
execute "ruff format" then "ruff check" (or use the project's uv wrapper if
available) targeting the src/cuga/backend/cuga_graph/nodes/ tree so
prepare_node.py and related files are formatted and any lint errors are
surfaced; after fixing any reported issues re-run to ensure
adapter._task_todos_ref usage and state.task_todos handling in prepare_node.py
meet ruff rules.

Source: Coding guidelines

@Sergey-Zeltyn Sergey-Zeltyn self-assigned this Jun 8, 2026
@Sergey-Zeltyn Sergey-Zeltyn added the bug Something isn't working label Jun 8, 2026
@Sergey-Zeltyn Sergey-Zeltyn moved this from Backlog to In progress in CUGA Workitems Jun 8, 2026
@sami-marreed

Copy link
Copy Markdown
Contributor

Code Review

Verdict: merge with fixes. The primary half of the fix is correct and demonstrably works (42/43→0/43 on the eval), but the second half — the state.task_todos reset — is effectively a no-op, and the scope is narrower than the title claims.

What's solid

  • Root-cause diagnosis is right. _task_todos_ref is a closure-shared list (cuga_lite_graph.py:180) held by the long-lived adapter (graph_adapter.py:62) and read first in prepare_system_content (graph_adapter.py:92). It's the dominant leak path, and _task_todos_ref.clear() fixes it for real — the ref lives outside LangGraph channels, so the mutation is immediately visible to call_model.
  • .clear() is safe — the ref is always a list, never None.
  • Correctly placed at the top of the START node, before any prompt construction.

Issues

🔴 Critical — state.task_todos = None is a no-op (prepare_node.py:78).
The node returns Command(goto="call_model", update={...}) and that update dict (prepare_node.py:630-642) does not contain task_todos. LangGraph reduces state only from the returned update; the in-place assignment is discarded. So when _task_todos_ref is empty, prepare_system_content falls through to the still-stale state.task_todos (graph_adapter.py:94) — the exact fallback this PR claims to close.
Fix: add "task_todos": None to the Command.update dict (gated on the same reset condition), not just the local mutation. One line.

🟠 Important — guard doesn't cover new-task-on-same-thread.
Both invoke paths accumulate chat_messages (SDK append path; cuga_lite_node.py:312 "Preserve existing chat history"). A genuinely new task as a follow-up turn has chat_messages > 1, so the guard is false and nothing clears → the original leak can still occur. The fix only covers the fresh-invoke-per-task scenario — which is exactly the AppWorld harness it was verified on. The title "between invocations" overstates coverage; consider re-scoping to "on a fresh conversation (≤1 message)."

🟠 Important — HITL preservation is incidental, not designed.
Resume re-enters the subgraph from START so prepare re-runs, but at that point chat_messages > 1 so the guard is already false. HITL is preserved by message-count accumulation, not by design intent — fine outcome, but the comment's framing is coincidental.

🟡 Minor

  • Concurrency: the adapter / _task_todos_ref is a process-global singleton (built once at startup); .clear() races across concurrent threads. Pre-existing (the todos tool already mutates the shared ref), but worth a follow-up for thread-scoped todos.
  • No regression test (+9/−0). A test that seeds a stale task_todos + populated _task_todos_ref, runs a fresh invoke, and asserts both the prompt section and the persisted state.task_todos are empty would catch the Critical bug directly.

Recommendation

  1. Must-fix: add task_todos to the Command.update dict — otherwise the checkpointer fallback path stays open.
  2. Add the regression test covering both reset targets.
  3. Tighten the title/description (scope to fresh conversation; drop the "between invocations" generality).
  4. File a follow-up for thread-scoped todos (concurrency + multi-turn coverage).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: cross-task leak of task_todos_ref pollutes turn-1 system prompt when enable_todos=true

2 participants