test(e2e): add e2e test suite for knowledge and skills components#232
test(e2e): add e2e test suite for knowledge and skills components#232Iftach-Shoham wants to merge 6 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSanitizes SKILL.md frontmatter, rejects unsafe skill names, adds install verification, enables per-call skills overrides, makes OpenSandbox concurrency-safe and resilient, forwards SDK skills options into runs, and adds comprehensive unit and multi-tier e2e tests including real-LLM scenarios. ChangesSkills Security & Reliability
SDK Skills Integration & Answer Fallback
E2E Test Infrastructure & Coverage
Sequence Diagram(s) sequenceDiagram
participant Caller as create_sandbox_tools
participant Executor as OpenSandboxExecutor
participant Lock as asyncio.Lock
participant Sandbox
participant CI as CodeInterpreter
Caller->>Executor: _get_or_create_interpreter(thread_id, requested_skills)
Executor->>Lock: acquire per-thread lock
Executor->>Executor: check cached interpreter & health
alt no healthy interpreter
Executor->>Sandbox: create sandbox workspace
Executor->>CI: create CodeInterpreter for sandbox
Executor->>Executor: _upload_skills_to_sandbox(resolved_cuga_folder) (if enabled)
Executor->>Executor: cache interpreter and record active_skills_config
end
Executor-->>Caller: return cached or new interpreter
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/e2e/test_knowledge_e2e.py (2)
4-7: ⚡ Quick winReplace EN DASH characters to clear Ruff warnings.
Line 4, Line 6, Line 53, Line 107, Line 150, and Line 204 use
–, which triggersRUF002/RUF003. Replace with plain-to keep lint clean.Proposed diff
- Tier 1 – component-level (no LLM, no graph): exercises KnowledgeEngine's public + Tier 1 - component-level (no LLM, no graph): exercises KnowledgeEngine's public - Tier 2 – graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts + Tier 2 - graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts -# Tier 1 – KnowledgeEngine lifecycle +# Tier 1 - KnowledgeEngine lifecycle -# Tier 1 – Knowledge awareness (prompt-injection path) +# Tier 1 - Knowledge awareness (prompt-injection path) -# Tier 1 – RAG retrieval path (tool boundary) +# Tier 1 - RAG retrieval path (tool boundary) -# Tier 2 – CugaLite graph integration +# Tier 2 - CugaLite graph integrationAlso applies to: 53-53, 107-107, 150-150, 204-204
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/test_knowledge_e2e.py` around lines 4 - 7, Replace all EN DASH characters (–) with ASCII hyphens (-) in the test docstrings/descriptions to silence Ruff RUF002/RUF003 warnings: update the strings such as "Tier 1 – component-level (no LLM, no graph)" and "Tier 2 – graph-level (CaptureChatModel)" and any other occurrences on the noted lines so they use "-" instead of "–". Search for the literal "–" in tests/e2e/test_knowledge_e2e.py (e.g., the Tier lines and the other flagged lines) and perform a simple character substitution, then run the linter to confirm warnings are cleared.
172-175: ⚡ Quick winAvoid positional tool selection in RAG tests.
Using
tools[0]makes these tests order-dependent. Select the knowledge tool by name to prevent brittle failures if provider ordering changes.Proposed diff
provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION) tools = await provider.get_all_tools() - knowledge_tool = tools[0] + knowledge_tool = next( + (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"), + None, + ) + assert knowledge_tool is not None, "knowledge_search_knowledge tool not found" @@ provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION) tools = await provider.get_all_tools() - knowledge_tool = tools[0] + knowledge_tool = next( + (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"), + None, + ) + assert knowledge_tool is not None, "knowledge_search_knowledge tool not found"Also applies to: 193-195
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/test_knowledge_e2e.py` around lines 172 - 175, The test currently selects a knowledge tool by position (tools[0]) which is order-dependent; instead, locate the correct tool returned by RealSearchKnowledgeToolProvider.get_all_tools() by matching its name/identifier (e.g., filter for tool.name or tool.id that matches the expected knowledge tool) and assign that result to knowledge_tool; make the same change for the other occurrence that uses positional selection so both tests pick the tool by explicit name rather than by index.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/conftest.py`:
- Around line 236-238: The teardown should guarantee engine.shutdown() runs even
if await engine.aclose() raises: wrap the await engine.aclose() call in a
try/finally (or call engine.shutdown() from a finally block) so that
engine.shutdown() is always invoked; if you need to preserve the original
exception, re-raise it after shutdown or log the aclose() error before
re-raising so resources are not leaked (refer to the engine.aclose() and
engine.shutdown() calls).
---
Nitpick comments:
In `@tests/e2e/test_knowledge_e2e.py`:
- Around line 4-7: Replace all EN DASH characters (–) with ASCII hyphens (-) in
the test docstrings/descriptions to silence Ruff RUF002/RUF003 warnings: update
the strings such as "Tier 1 – component-level (no LLM, no graph)" and "Tier 2 –
graph-level (CaptureChatModel)" and any other occurrences on the noted lines so
they use "-" instead of "–". Search for the literal "–" in
tests/e2e/test_knowledge_e2e.py (e.g., the Tier lines and the other flagged
lines) and perform a simple character substitution, then run the linter to
confirm warnings are cleared.
- Around line 172-175: The test currently selects a knowledge tool by position
(tools[0]) which is order-dependent; instead, locate the correct tool returned
by RealSearchKnowledgeToolProvider.get_all_tools() by matching its
name/identifier (e.g., filter for tool.name or tool.id that matches the expected
knowledge tool) and assign that result to knowledge_tool; make the same change
for the other occurrence that uses positional selection so both tests pick the
tool by explicit name rather than by index.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 364c908c-72b0-4424-b9c4-177311503a5e
📒 Files selected for processing (4)
tests/e2e/__init__.pytests/e2e/conftest.pytests/e2e/test_knowledge_e2e.pytests/e2e/test_skills_e2e.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/test_skills_e2e.py`:
- Line 99: Replace hardcoded "/tmp" string literals in the test fixtures with a
temporary directory fixture; locate instances like
source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py and change them
to use the pytest tmp_path (or tempfile.gettempdir/tmp_path_factory) to
construct paths (e.g., str(tmp_path / name / "SKILL.md")), and update the other
occurrences on the same file (lines referencing "/tmp/...") similarly so all
test paths are built from the tmp_path/tempfile API instead of a hardcoded
"/tmp".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ba2ee880-4eb2-40ad-af65-317306217a57
📒 Files selected for processing (3)
tests/e2e/conftest.pytests/e2e/test_knowledge_e2e.pytests/e2e/test_skills_e2e.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/e2e/test_knowledge_e2e.py
- tests/e2e/conftest.py
| name=name, | ||
| description=f"{name} description", | ||
| body=body, | ||
| source=f"/tmp/{name}/SKILL.md", |
There was a problem hiding this comment.
Replace hardcoded /tmp literals to avoid Ruff S108 CI failures.
Line 99, Line 164, Line 176, Line 177, and Line 187 use hardcoded /tmp/... paths in test fixtures. Ruff flags these as errors (S108), which can block CI even though these are test-only strings.
Proposed patch
- source=f"/tmp/{name}/SKILL.md",
+ source=f"skills/{name}/SKILL.md",- source="/tmp/SKILL.md",
+ source="skills/SKILL.md",- SkillEntry("alpha", "Alpha skill", "## Body", "/tmp/a/SKILL.md", ()),
- SkillEntry("beta", "Beta skill", "## Body", "/tmp/b/SKILL.md", ()),
+ SkillEntry("alpha", "Alpha skill", "## Body", "skills/a/SKILL.md", ()),
+ SkillEntry("beta", "Beta skill", "## Body", "skills/b/SKILL.md", ()),- entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "/tmp/g/SKILL.md", ())
+ entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "skills/g/SKILL.md", ())Also applies to: 164-164, 176-177, 187-187
🧰 Tools
🪛 Ruff (0.15.12)
[error] 99-99: Probable insecure usage of temporary file or directory: "/tmp/"
(S108)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/test_skills_e2e.py` at line 99, Replace hardcoded "/tmp" string
literals in the test fixtures with a temporary directory fixture; locate
instances like source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py
and change them to use the pytest tmp_path (or
tempfile.gettempdir/tmp_path_factory) to construct paths (e.g., str(tmp_path /
name / "SKILL.md")), and update the other occurrences on the same file (lines
referencing "/tmp/...") similarly so all test paths are built from the
tmp_path/tempfile API instead of a hardcoded "/tmp".
Implements the full test coverage requested in issue #206. **What's covered** Knowledge (`tests/e2e/test_knowledge_e2e.py`): - Tier 1 – KnowledgeEngine lifecycle: ingest → search → delete → replace cycle - Tier 1 – Awareness path: get_knowledge_summary() output and format_knowledge_context() collection naming / sanitization - Tier 1 – RAG retrieval path: RealSearchKnowledgeToolProvider calls engine.search() directly, asserts ingested content is returned by the tool - Tier 2 – CugaLiteGraph integration: agent-scoped and session-scoped knowledge summaries injected into the system prompt; knowledge + skills coexistence Skills (`tests/e2e/test_skills_e2e.py`): - Tier 1 – Discovery: discover_skills() finds all skills, preserves description, parses pip and npm requirements - Tier 1 – Registry: load_skill() body ordering, uv/npm install commands, mixed deps, normalization hint - Tier 1 – Tools/block: create_skill_tools() returns load_skill; format_available_skills_block() lists names and descriptions - Tier 2 – CugaLiteGraph integration: skills block appears in system prompt; load_skill bound to model in native tool-calling mode - Blocked: test_skill_executed_via_sub_agent skipped pending #199 **Shared infrastructure (`tests/e2e/conftest.py`)** - CaptureChatModel: hermetic mock LLM that records inputs and replays scripted responses - KnowledgeToolProvider: stub search tool (triggers awareness detection path) - RealSearchKnowledgeToolProvider: real search tool backed by KnowledgeEngine (covers RAG path) - knowledge_engine fixture: isolated fastembed + sqlite-vec engine in tmp_path Closes #206
- conftest.py: wrap engine.aclose() in try/finally so shutdown() is always called even if aclose() raises (resource leak fix) - test_knowledge_e2e.py: replace EN dash characters with ASCII hyphens in docstrings/comments to silence Ruff RUF002/RUF003 - test_knowledge_e2e.py: select knowledge_search_knowledge tool by name instead of positional index tools[0] to avoid order-dependent failures
What was broken
Security
1 — Prompt injection via SKILL.md frontmatter (loader.py)
Jinja2 delimiters ({{ }}, {% %}, {# #}) in a skill's name or description were forwarded unescaped into the system prompt template. A malicious or misconfigured
SKILL.md could inject arbitrary template directives. Fixed by stripping matching patterns at parse time and logging a warning.
2 — Path traversal in skill names (loader.py)
A skill named ../../etc/passwd was accepted and later used to construct file paths during sandbox upload. Fixed by rejecting names that contain /, \, or ...
---
Reliability
3 — Concurrent sandbox creation race (opensandbox_executor.py)
Two coroutines calling _get_or_create_interpreter for the same thread_id simultaneously could both pass the cache check and create two remote containers,
orphaning one. Fixed with a per-key asyncio.Lock.
4 — Skills upload failure killed the sandbox (opensandbox_executor.py)
If _upload_skills_to_sandbox raised (e.g. transient network error), the sandbox was discarded rather than cached. On the next request a new container was
created, leaking the first one. Fixed by catching the exception, logging a warning, and caching the sandbox regardless.
5 — Package install not verified (registry.py)
After generating a uv pip install / npm install step in the skill script, there was no verification that the install actually succeeded. Fixed by appending uv
pip show <pkgs> / npm list <pkgs> so the agent surfaces install failures immediately instead of getting a confusing import error later.
6 — Silent skill name collision (loader.py)
When two SKILL.md files declared the same name, the later one silently replaced the earlier one. Now logs a DEBUG message identifying both paths.
---
SDK / graph wiring
7 — enable_skills / skills_folder ignored by the graph (sdk.py, cuga_lite_graph.py)
CugaAgent(enable_skills=True, skills_folder=...) stored the values but never forwarded them into run_config[configurable]. The graph always fell back to
settings.skills.enabled and CUGA_FOLDER, making the SDK parameters a silent no-op. Fixed for both invoke() and chat().
8 — Double .cuga suffix (sdk.py)
If a caller passed a path already ending in .cuga as skills_folder, the SDK appended another .cuga, directing discovery to the wrong location with no error.
Fixed with a guard that skips the suffix when already present.
9 — enable_shell_tool=False suppressed skills silently (prompt_utils.py)
When advanced_features.enable_shell_tool=False, the skills block was cleared with no indication. Fixed by emitting a WARNING log before clearing it.
10 — final_answer empty for reasoning models (sdk.py)
Some reasoning models return content='' with the answer in additional_kwargs[reasoning_content]. The answer-extraction loop missed this field and fell through
to the interrupt-handling path, returning an empty answer. Fixed with a fallback pass over chat_messages.
---
Tests added
Shared infrastructure — tests/e2e/conftest.py
- CaptureChatModel — hermetic mock LLM that records all inputs and replays a scripted response queue; fails loudly if the graph makes more calls than expected
- MinimalToolProvider — no-op tool provider base for test isolation
- write_skill / extract_system_content — helpers used across all skill test tiers
- real_llm fixture — builds the real configured LLM from project settings for Tier 3 tests
- pytest_terminal_summary hook — prints a pass/fail table of all Tier 3 real-LLM results at session end
Component & graph — tests/e2e/test_skills_e2e.py
┌──────┬────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Tier │ Class │ Coverage │
├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ TestSkillDiscovery │ filesystem scan, multiple skills, description preserved, pip/npm requirements parsed │
├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ TestSkillRegistry │ load_skill returns body, install preamble generated for pip/npm, missing skill raises │
├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ TestSkillTools │ create_skill_tools returns a load_skill StructuredTool; calling it returns the body │
├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ TestSkillBlock │ format_available_skills_block includes all names and descriptions │
├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 2 │ TestSkillsGraphIntegration │ CugaLiteGraph + CaptureChatModel — skills block present in system message; load_skill bound to model │
└──────┴────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘
Real LLM — tests/e2e/test_skills_llm_e2e.py
Three skills containing fabricated proprietary data (formula coefficients, internal part codes, invented system names) that the model cannot know from training.
Each positive test asserts the expected value appears; each paired negative control runs with skills disabled and asserts it is absent — confirming the skill
is the genuine gating factor, not the model's prior knowledge.
SDK surface — tests/e2e/test_skills_sdk_e2e.py
- Tier 1: enable_skills / skills_folder stored on agent; forwarded into configurable; .cuga not double-appended; plain workspace root gets .cuga appended
- Tier 3: same three fabricated-data skills driven through CugaAgent.invoke() instead of the raw graph
Real public skills — tests/e2e/test_skills_real_e2e.py
Downloads the real Vercel agent-skills SKILL.md, invokes the agent, and asserts that taxonomy-specific rule identifiers (long enough to be unlikely from
training alone) appear in the answer. Skips automatically when GitHub is unreachable.
Presentation demo — tests/e2e/test_skills_presentation_e2e.py
Downloads the real pptx skill from anthropics/skills, pre-installs pptxgenjs, provides a run_node_script tool, and asserts a non-trivially-sized .pptx file is
produced. Output saved to tests/e2e/demo_outputs/ for manual inspection. Run with -m e2e.
Unit — tests/unit/test_skill_loader.py (additions)
- _copy_skills_to_workspace copies SKILL.md into the per-thread workspace
- _copy_skills_to_workspace is a no-op when skills_enabled=False
- SKILL.md with a path-traversal name is silently rejected
Unit — tests/unit/test_opensandbox_executor.py
- Concurrent _get_or_create_interpreter calls produce exactly one Sandbox.create call (lock serialises creation)
- Upload failure logs a warning but the sandbox is still cached (no orphan container)
- create_sandbox_tools warns when the requested skills config differs from what was active at sandbox creation time
- Reformat cuga_lite_graph.py, local/native/opensandbox executors, and skills loader - Reformat e2e test files (test_skills_e2e, test_skills_llm_e2e, test_skills_sdk_e2e) - Reformat unit test files (test_opensandbox_executor, test_skill_loader)
e838da9 to
3cac208
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py (1)
282-302:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing
_skills_configpopulation — skills overrides are never applied.Unlike
LocalSandboxExecutorandNativeSandboxExecutor, this executor'screate_sandbox_tools()does not acceptcuga_folderorskills_enabledparameters and never setsself._skills_config[key]. When_get_or_create_interpreter()later reads_skills_config.get(key, (None, None))at line 164, it always gets the default, so caller-supplied skills configuration is ignored.The unit tests work around this by manually setting
executor._skills_config["thread-A"] = (...)before calling the interpreter method, but real callers won't do this.🐛 Proposed fix to add skills config parameters and storage
def create_sandbox_tools( self, thread_id: Optional[str] = None, + cuga_folder: Optional[str] = None, + skills_enabled: Optional[bool] = None, ) -> list[StructuredTool]: """Return the run_command StructuredTool bound to thread_id. Filesystem tools (read/write/list/edit/...) now come from the consolidated ``filesystem`` package via a ``RemoteSandboxBackend`` (wired in ``cuga_lite_graph``); they are no longer produced here. """ + key = thread_id or "_default" + self._skills_config[key] = (cuga_folder, skills_enabled) return [ StructuredTool.from_function(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py` around lines 282 - 302, create_sandbox_tools currently ignores caller-provided skills overrides so _get_or_create_interpreter always sees defaults; change create_sandbox_tools(self, thread_id: Optional[str] = None) to accept the skills params (e.g. add cuga_folder: Optional[str]=None, skills_enabled: Optional[bool]=None), compute the key used by _get_or_create_interpreter (same thread_id logic), and store the tuple into self._skills_config[key] = (cuga_folder, skills_enabled) before returning the StructuredTool; keep the StructuredTool.from_function call using self.create_run_command_tool(thread_id) unchanged so only the skills config storage is added.
🧹 Nitpick comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py (1)
91-97: 💤 Low valueMutable class-level dictionaries are shared across all instances.
These dictionaries are defined at class level, meaning all
OpenSandboxExecutorinstances share the same cache. While this may be intentional for singleton-style usage, it's fragile if multiple instances are created. Consider usingClassVartype hints to make the intent explicit, or initialize these in__init__if instance isolation is needed.💡 Suggested type annotation to document intent
+from typing import Any, Callable, ClassVar, Optional + class OpenSandboxExecutor(RemoteExecutor): """Provides a persistent sandbox with shell + filesystem tools.""" # Interpreter cache: thread_id -> CodeInterpreter instance - _sandboxes: dict[str, Any] = {} + _sandboxes: ClassVar[dict[str, Any]] = {} # Per-thread skills config requested on the current invocation - _skills_config: dict[str, tuple[Optional[str], Optional[bool]]] = {} + _skills_config: ClassVar[dict[str, tuple[Optional[str], Optional[bool]]]] = {} # Skills config that was actually applied when the sandbox was created - _active_skills_config: dict[str, tuple[Optional[str], Optional[bool]]] = {} + _active_skills_config: ClassVar[dict[str, tuple[Optional[str], Optional[bool]]]] = {} # Per-key asyncio locks to prevent concurrent sandbox creation for the same thread - _locks: dict[str, asyncio.Lock] = {} + _locks: ClassVar[dict[str, asyncio.Lock]] = {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py` around lines 91 - 97, The class-level mutable dicts _sandboxes, _skills_config, _active_skills_config, and _locks on OpenSandboxExecutor are shared across all instances; either make the intent explicit by annotating them as ClassVar[...] (to document shared, singleton-style state) or move their initialization into the instance __init__ (self._sandboxes, self._skills_config, self._active_skills_config, self._locks) to ensure instance isolation; update type hints accordingly and ensure all methods that reference these symbols use self.<name> when you switch to instance attributes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 282-302: create_sandbox_tools currently ignores caller-provided
skills overrides so _get_or_create_interpreter always sees defaults; change
create_sandbox_tools(self, thread_id: Optional[str] = None) to accept the skills
params (e.g. add cuga_folder: Optional[str]=None, skills_enabled:
Optional[bool]=None), compute the key used by _get_or_create_interpreter (same
thread_id logic), and store the tuple into self._skills_config[key] =
(cuga_folder, skills_enabled) before returning the StructuredTool; keep the
StructuredTool.from_function call using self.create_run_command_tool(thread_id)
unchanged so only the skills config storage is added.
---
Nitpick comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 91-97: The class-level mutable dicts _sandboxes, _skills_config,
_active_skills_config, and _locks on OpenSandboxExecutor are shared across all
instances; either make the intent explicit by annotating them as ClassVar[...]
(to document shared, singleton-style state) or move their initialization into
the instance __init__ (self._sandboxes, self._skills_config,
self._active_skills_config, self._locks) to ensure instance isolation; update
type hints accordingly and ensure all methods that reference these symbols use
self.<name> when you switch to instance attributes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7fa69b50-f1ff-4626-a376-59fb19beb610
📒 Files selected for processing (17)
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/local/local_sandbox_executor.pysrc/cuga/backend/cuga_graph/nodes/cuga_lite/executors/native/native_sandbox_executor.pysrc/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.pysrc/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.pysrc/cuga/backend/skills/loader.pysrc/cuga/backend/skills/registry.pysrc/cuga/sdk.pytests/e2e/__init__.pytests/e2e/conftest.pytests/e2e/demo_outputs/.gitignoretests/e2e/test_skills_e2e.pytests/e2e/test_skills_llm_e2e.pytests/e2e/test_skills_presentation_e2e.pytests/e2e/test_skills_real_e2e.pytests/e2e/test_skills_sdk_e2e.pytests/unit/test_opensandbox_executor.pytests/unit/test_skill_loader.py
💤 Files with no reviewable changes (10)
- tests/e2e/demo_outputs/.gitignore
- tests/unit/test_opensandbox_executor.py
- tests/unit/test_skill_loader.py
- src/cuga/sdk.py
- tests/e2e/test_skills_presentation_e2e.py
- tests/e2e/test_skills_sdk_e2e.py
- tests/e2e/test_skills_llm_e2e.py
- tests/e2e/conftest.py
- tests/e2e/test_skills_real_e2e.py
- tests/e2e/test_skills_e2e.py
✅ Files skipped from review due to trivial changes (1)
- src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
- Add cuga_folder and skills_enabled optional params (matches local/native) - Store (cuga_folder, skills_enabled) into _skills_config[key] so that _get_or_create_interpreter reads the caller-supplied overrides instead of always falling back to settings defaults
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 294-295: When updating _skills_config for a given key (computed
from thread_id or "_default"), first check if a live interpreter exists in
self._interpreters for that same key and compare its active config to the
incoming (cuga_folder, skills_enabled); if they differ, emit a warning that the
new config will be ignored until release_sandbox() is called (so callers know
the per-call override won’t take effect). Implement this check alongside the
existing assignment in the same place where _skills_config is updated,
referencing _get_or_create_interpreter, _interpreters, _skills_config, and
release_sandbox in the message so the log is clear and matches the tested
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 51b9f8ed-9ea4-479c-9814-31377c333be9
📒 Files selected for processing (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py
| key = thread_id or "_default" | ||
| self._skills_config[key] = (cuga_folder, skills_enabled) |
There was a problem hiding this comment.
Warn when a live sandbox's skills config changes.
Line 295 updates _skills_config, but _get_or_create_interpreter() returns the cached live interpreter on Lines 131-135 before any new config can be applied. That means a different cuga_folder/skills_enabled is silently ignored until release_sandbox() is called, which breaks the per-call override behavior and the stale-config warning path exercised in tests/unit/test_opensandbox_executor.py:251-278.
Suggested fix
key = thread_id or "_default"
- self._skills_config[key] = (cuga_folder, skills_enabled)
+ requested = (cuga_folder, skills_enabled)
+ active = self._active_skills_config.get(key)
+ if key in self._sandboxes and active is not None and requested != active:
+ logger.warning(
+ "[OpenSandboxExecutor] Skills config changed for a live sandbox; "
+ "call release_sandbox() before requesting different "
+ "cuga_folder/skills_enabled values."
+ )
+ self._skills_config[key] = requested🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`
around lines 294 - 295, When updating _skills_config for a given key (computed
from thread_id or "_default"), first check if a live interpreter exists in
self._interpreters for that same key and compare its active config to the
incoming (cuga_folder, skills_enabled); if they differ, emit a warning that the
new config will be ignored until release_sandbox() is called (so callers know
the per-call override won’t take effect). Implement this check alongside the
existing assignment in the same place where _skills_config is updated,
referencing _get_or_create_interpreter, _interpreters, _skills_config, and
release_sandbox in the message so the log is clear and matches the tested
behavior.
Summary
Closes #206.
Closes #311.
Adds a new
tests/e2e/directory with full end-to-end test coverage for the knowledge and skills components as requested in the issue. Tests are structured in two tiers per component: Tier 1 exercises the component API directly (no LLM, no graph — fast and surgical), Tier 2 runs the real compiledCugaLiteGraphwith a hermetic mock LLM and asserts on what actually reached the model.What's covered
Knowledge (
test_knowledge_e2e.py) — 12 testsTestKnowledgeEngineLifecycleTestKnowledgeAwarenessget_knowledge_summary()output;format_knowledge_context()naming and special-char sanitizationTestKnowledgeRagPathknowledge_search_knowledgetool viaRealSearchKnowledgeToolProvidercallsengine.search()and returns ingested content; stale chunk not returned after replaceTestKnowledgeCugaLiteIntegrationSkills (
test_skills_e2e.py) — 15 tests + 1 skipTestSkillDiscoverydiscover_skills()finds single/multiple skills, preserves description, parses pip and npm requirementsTestSkillRegistryload_skill()body order,uv pip install/npm installcommands, mixed deps (STEP 1 before STEP 2), normalization hintTestSkillToolsAndBlockcreate_skill_tools()returnsload_skill; block contains names and descriptionsTestSkillsCugaLiteIntegrationload_skillbound to model in native tool-calling modeTestSkillsBlockedPathstest_skill_executed_via_sub_agent— skipped, blocked on #199Shared infrastructure (
conftest.py)CaptureChatModel— hermetic mock LLM: records every message list and tool binding, replays scripted responses, raises on queue exhaustion (catches unexpected graph loops)KnowledgeToolProvider— stubknowledge_search_knowledgetool (triggers thestartswith("knowledge_")detection incuga_lite_graph.py:1493) without requiring a running MCP serverRealSearchKnowledgeToolProvider— realknowledge_search_knowledgetool backed byengine.search(), covers the RAG retrieval pathknowledge_enginefixture — isolated fastembed + sqlite-vec engine intmp_path, monkeypatched off any real DBTest plan
uv run pytest tests/e2e/test_knowledge_e2e.py -v— 12 passeduv run pytest tests/e2e/test_skills_e2e.py -v— 15 passed, 1 skipped (expected, blocked on [Feature] Skills should be able to spawn sub-agents at runtime via use_sub_agents #199)Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Security
Tests