fix(cuga_lite): reset task_todos state between invocations by Sergey-Zeltyn · Pull Request #315 · cuga-project/cuga-agent

Sergey-Zeltyn · 2026-06-08T08:22:06Z

Bug fix

Fixes #314

Summary

When advanced_features.enable_todos=true, the closure-scoped task_todos_ref (plus the parallel state.task_todos checkpointer field) outlived a single agent.invoke(). From the second task onward the previous task's plan leaked into the next task's turn-1 system prompt as ## Current task todos, biasing the model's first reasoning step before it had any chance to call create_update_todos for the new task.

Root cause: CugaAgent._create_graph memoizes on self._graph and only resets in add_tool() (sdk.py:2224), so every .invoke() reuses the same compiled graph and the same closure-captured list created in create_cuga_lite_graph (cuga_lite_graph.py:180). The tool body in todos.py:123-126 only does clear() + extend(new) — it never resets on a new task. Nothing in the SDK, agent loop, or framework cleared these between invocations.

Fix: at the top of prepare_tools_and_apps (prepare_node.py), detect a fresh conversation via len(state.chat_messages) <= 1 and clear both adapter._task_todos_ref and state.task_todos. The guard preserves HITL resume (turn 2+ keeps the in-flight plan).

Verification

Reproduced and verified on a 43-task AppWorld eval bundle (gpt-4.1, enable_todos=true):

Metric	Before fix	After fix
Traces with `## Current task todos` in turn-1 system prompt	42 / 43	0 / 43
Traces with `## Current Plan` in turn-1 (state path)	—	0 / 43
Mean fraction of LLM turns spent on todo-only blocks	7.1%	5.6%

Concrete example from the pre-fix bundle — task asking about HR/candidates/report_template.md opens with a stale Gmail plan injected as authoritative:

## Current task todos
1. **[pending]** Fetch all unread email threads from Gmail inbox (handle pagination)
2. **[pending]** Filter threads for priority-1
3. **[pending]** Count the number of priority-1 unread threads
4. **[pending]** Return the count to the user

After the fix, every one of the 43 traces starts turn 1 with a clean system prompt.

Testing

Verified fix locally; tests pass
Re-ran 43-task AppWorld eval with enable_todos=true: leak signature dropped from 42/43 to 0/43
Existing unit tests under cuga_lite/tests/test_agent_graph_adapter.py exercise prepare_system_content directly (not prepare_tools_and_apps), so they're unaffected by the new clear logic

Summary by CodeRabbit

Bug Fixes
- Fixed an issue where previous task information could persist into new conversation turns. The system now properly clears state when starting a fresh conversation to prevent unintended data carry-over between separate interactions.

The closure-scoped `task_todos_ref` (and the parallel `state.task_todos` checkpointer field) outlived a single `agent.invoke()`. With `enable_todos=true`, the previous task's plan leaked into the next task's turn-1 system prompt as `## Current task todos`, biasing the model's first reasoning step toward an unrelated plan. Clear both at the top of `prepare_tools_and_apps`, gated on a fresh conversation (`len(chat_messages) <= 1`) so HITL resume is unaffected. Fixes #314. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Sergey Zeltyn <sergeyz@il.ibm.com>

coderabbitai · 2026-06-08T08:22:22Z

📝 Walkthrough

Walkthrough

The PR fixes a bug where task plans from a prior CugaAgent.invoke() call leak into the system prompt of subsequent invocations when todos are enabled. A guard at the start of prepare_tools_and_apps now clears both the closure-scoped adapter._task_todos_ref and state.task_todos when a fresh conversation begins.

Changes

Task todos reset on new conversation

Layer / File(s)	Summary
Clear task todos on conversation start `src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py`	When `chat_messages` is empty or contains only the first message, the function clears `adapter._task_todos_ref` and resets `state.task_todos` to `None` to prevent stale task plans from contaminating the turn-1 system prompt across multiple invocations.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A todo list that wouldn't fade,
From task to task, its mark was made!
But now each chat begins fresh and clean,
No phantom plans will cross between.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main fix: resetting task_todos state between invocations to prevent cross-task leakage, which is the core objective of the PR.
Linked Issues check	✅ Passed	The PR implements the exact fix suggested in issue `#314`: clearing adapter._task_todos_ref and state.task_todos at the top of prepare_tools_and_apps when a fresh conversation is detected (len(state.chat_messages) <= 1), with verification showing the bug is resolved.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing issue `#314`; no extraneous modifications, refactoring, or unrelated features are present in the diff.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/cross-task-leak-todos

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py`:
- Around line 73-80: Run ruff formatting and linting on the cuga_graph/nodes
changes: execute "ruff format" then "ruff check" (or use the project's uv
wrapper if available) targeting the src/cuga/backend/cuga_graph/nodes/ tree so
prepare_node.py and related files are formatted and any lint errors are
surfaced; after fixing any reported issues re-run to ensure
adapter._task_todos_ref usage and state.task_todos handling in prepare_node.py
meet ruff rules.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0cae5ba2-d548-4644-857a-63d0aa75c3b8

📥 Commits

Reviewing files that changed from the base of the PR and between 5bd8644 and 1780408.

📒 Files selected for processing (1)

src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py

coderabbitai · 2026-06-08T08:24:56Z

+        # The adapter's task_todos_ref is closure-scoped per compiled graph and
+        # outlives a single .invoke(); state.task_todos can also persist via a
+        # thread-keyed checkpointer. Reset both at the start of a new
+        # conversation so a previous task's plan doesn't leak into this one's
+        # turn-1 system prompt.
+        if len(state.chat_messages or []) <= 1:
+            adapter._task_todos_ref.clear()
+            state.task_todos = None


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

uv run ruff format src/cuga/backend/cuga_graph/nodes/ uv run ruff check src/cuga/backend/cuga_graph/nodes/

Repository: cuga-project/cuga-agent

Length of output: 151

Run Ruff format + Ruff check for cuga_graph/nodes changes

The required ruff format and ruff check commands haven’t been executed in this environment (the uv run ... approach requires uv to be available). Run formatting and linting for src/cuga/backend/cuga_graph/nodes/ (via uv if available, otherwise ruff format + ruff check directly).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py` around lines 73 - 80, Run ruff formatting and linting on the cuga_graph/nodes changes: execute "ruff format" then "ruff check" (or use the project's uv wrapper if available) targeting the src/cuga/backend/cuga_graph/nodes/ tree so prepare_node.py and related files are formatted and any lint errors are surfaced; after fixing any reported issues re-run to ensure adapter._task_todos_ref usage and state.task_todos handling in prepare_node.py meet ruff rules.

Source: Coding guidelines

sami-marreed · 2026-06-09T22:49:55Z

Code Review

Verdict: merge with fixes. The primary half of the fix is correct and demonstrably works (42/43→0/43 on the eval), but the second half — the state.task_todos reset — is effectively a no-op, and the scope is narrower than the title claims.

What's solid

Root-cause diagnosis is right. _task_todos_ref is a closure-shared list (cuga_lite_graph.py:180) held by the long-lived adapter (graph_adapter.py:62) and read first in prepare_system_content (graph_adapter.py:92). It's the dominant leak path, and _task_todos_ref.clear() fixes it for real — the ref lives outside LangGraph channels, so the mutation is immediately visible to call_model.
.clear() is safe — the ref is always a list, never None.
Correctly placed at the top of the START node, before any prompt construction.

Issues

🔴 Critical — state.task_todos = None is a no-op (prepare_node.py:78).
The node returns Command(goto="call_model", update={...}) and that update dict (prepare_node.py:630-642) does not contain task_todos. LangGraph reduces state only from the returned update; the in-place assignment is discarded. So when _task_todos_ref is empty, prepare_system_content falls through to the still-stale state.task_todos (graph_adapter.py:94) — the exact fallback this PR claims to close.
Fix: add "task_todos": None to the Command.update dict (gated on the same reset condition), not just the local mutation. One line.

🟠 Important — guard doesn't cover new-task-on-same-thread.
Both invoke paths accumulate chat_messages (SDK append path; cuga_lite_node.py:312 "Preserve existing chat history"). A genuinely new task as a follow-up turn has chat_messages > 1, so the guard is false and nothing clears → the original leak can still occur. The fix only covers the fresh-invoke-per-task scenario — which is exactly the AppWorld harness it was verified on. The title "between invocations" overstates coverage; consider re-scoping to "on a fresh conversation (≤1 message)."

🟠 Important — HITL preservation is incidental, not designed.
Resume re-enters the subgraph from START so prepare re-runs, but at that point chat_messages > 1 so the guard is already false. HITL is preserved by message-count accumulation, not by design intent — fine outcome, but the comment's framing is coincidental.

🟡 Minor

Concurrency: the adapter / _task_todos_ref is a process-global singleton (built once at startup); .clear() races across concurrent threads. Pre-existing (the todos tool already mutates the shared ref), but worth a follow-up for thread-scoped todos.
No regression test (+9/−0). A test that seeds a stale task_todos + populated _task_todos_ref, runs a fresh invoke, and asserts both the prompt section and the persisted state.task_todos are empty would catch the Critical bug directly.

Recommendation

Must-fix: add task_todos to the Command.update dict — otherwise the checkpointer fallback path stays open.
Add the regression test covering both reset targets.
Tighten the title/description (scope to fresh conversation; drop the "between invocations" generality).
File a follow-up for thread-scoped todos (concurrency + multi-turn coverage).

Sergey-Zeltyn requested a review from sami-marreed June 8, 2026 08:22

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Sergey-Zeltyn self-assigned this Jun 8, 2026

Sergey-Zeltyn added the bug Something isn't working label Jun 8, 2026

Sergey-Zeltyn added this to CUGA Workitems Jun 8, 2026

github-project-automation Bot moved this to Backlog in CUGA Workitems Jun 8, 2026

Sergey-Zeltyn moved this from Backlog to In progress in CUGA Workitems Jun 8, 2026

Sergey-Zeltyn removed this from CUGA Workitems Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cuga_lite): reset task_todos state between invocations#315

fix(cuga_lite): reset task_todos state between invocations#315
Sergey-Zeltyn wants to merge 1 commit into
mainfrom
fix/cross-task-leak-todos

Sergey-Zeltyn commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

sami-marreed commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sergey-Zeltyn commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug fix

Summary

Verification

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

sami-marreed commented Jun 9, 2026

Code Review

What's solid

Issues

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sergey-Zeltyn commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading