test(e2e): add e2e test suite for knowledge and skills components by Iftach-Shoham · Pull Request #232 · cuga-project/cuga-agent

Iftach-Shoham · 2026-05-17T07:59:15Z

Summary

Closes #206.
Closes #311.

Adds a new tests/e2e/ directory with full end-to-end test coverage for the knowledge and skills components as requested in the issue. Tests are structured in two tiers per component: Tier 1 exercises the component API directly (no LLM, no graph — fast and surgical), Tier 2 runs the real compiled CugaLiteGraph with a hermetic mock LLM and asserts on what actually reached the model.

What's covered

Knowledge (`test_knowledge_e2e.py`) — 12 tests

Class	What it tests
`TestKnowledgeEngineLifecycle`	ingest → search → delete → re-ingest (stale chunk purge)
`TestKnowledgeAwareness`	`get_knowledge_summary()` output; `format_knowledge_context()` naming and special-char sanitization
`TestKnowledgeRagPath`	real `knowledge_search_knowledge` tool via `RealSearchKnowledgeToolProvider` calls `engine.search()` and returns ingested content; stale chunk not returned after replace
`TestKnowledgeCugaLiteIntegration`	agent-scoped knowledge in system prompt; session-scoped knowledge in system prompt; knowledge + skills coexistence — all assert both filename and document content appear

Skills (`test_skills_e2e.py`) — 15 tests + 1 skip

Class	What it tests
`TestSkillDiscovery`	`discover_skills()` finds single/multiple skills, preserves description, parses pip and npm requirements
`TestSkillRegistry`	`load_skill()` body order, `uv pip install` / `npm install` commands, mixed deps (STEP 1 before STEP 2), normalization hint
`TestSkillToolsAndBlock`	`create_skill_tools()` returns `load_skill`; block contains names and descriptions
`TestSkillsCugaLiteIntegration`	skill name in system prompt; `load_skill` bound to model in native tool-calling mode
`TestSkillsBlockedPaths`	`test_skill_executed_via_sub_agent` — skipped, blocked on #199

Shared infrastructure (`conftest.py`)

CaptureChatModel — hermetic mock LLM: records every message list and tool binding, replays scripted responses, raises on queue exhaustion (catches unexpected graph loops)
KnowledgeToolProvider — stub knowledge_search_knowledge tool (triggers the startswith("knowledge_") detection in cuga_lite_graph.py:1493) without requiring a running MCP server
RealSearchKnowledgeToolProvider — real knowledge_search_knowledge tool backed by engine.search(), covers the RAG retrieval path
knowledge_engine fixture — isolated fastembed + sqlite-vec engine in tmp_path, monkeypatched off any real DB

Test plan

uv run pytest tests/e2e/test_knowledge_e2e.py -v — 12 passed
uv run pytest tests/e2e/test_skills_e2e.py -v — 15 passed, 1 skipped (expected, blocked on [Feature] Skills should be able to spawn sub-agents at runtime via use_sub_agents #199)
No real LLM, no API keys, no network required — fully hermetic

Summary by CodeRabbit

New Features
- Agent "skills" support: enable/disable per-agent, configurable skills folder, and tooling to load skills at runtime.
- Post-install verification added for pip and npm dependencies.
Bug Fixes / Improvements
- Per-thread sandbox and tool lifecycle improvements to reliably apply per-request skills settings and avoid concurrent creation issues.
- Prompt now warns when skills are requested but required shell tooling is disabled.
Security
- Frontmatter sanitization to prevent prompt-injection and rejection of skill names with path traversal.
Tests
- Extensive end-to-end and unit tests added for skills, SDK wiring, LLM gating, presentation skill, and sandbox behavior.

coderabbitai · 2026-05-17T07:59:30Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Sanitizes SKILL.md frontmatter, rejects unsafe skill names, adds install verification, enables per-call skills overrides, makes OpenSandbox concurrency-safe and resilient, forwards SDK skills options into runs, and adds comprehensive unit and multi-tier e2e tests including real-LLM scenarios.

Changes

Skills Security & Reliability

Layer / File(s)	Summary
Jinja2 sanitization & path traversal prevention `src/cuga/backend/skills/loader.py`	Strips Jinja2 delimiters from skill `name`/`description` frontmatter and rejects sanitized names containing `/`, `\`, or `..`.
Skills verification commands `src/cuga/backend/skills/registry.py`	`load_skill()` now appends `uv pip show ...` for pip and `npm list ...` for npm to verify installs.
OpenSandbox concurrency & upload resilience `src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`, `tests/unit/test_opensandbox_executor.py`	Adds per-thread asyncio locks, requested/active skills config, logs and caches interpreter on upload failures, resolves cuga_folder from argument→env→policy, and clears per-thread state on release; unit tests cover concurrency, upload-failure caching, stale-config warnings, and release cleanup.
Executor skill parameter plumbing `src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/local/local_sandbox_executor.py`, `native/native_sandbox_executor.py`	`_copy_skills_to_workspace` and `create_sandbox_tools` accept `cuga_folder` and `skills_enabled` overrides and resolve the effective folder from argument→env→policy.
Prompt gating warning `src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py`	Logs a warning when `skills_enabled=True` but `advanced_features.enable_shell_tool=False`.

SDK Skills Integration & Answer Fallback

Layer / File(s)	Summary
CugaAgent skills parameters & graph wiring `src/cuga/sdk.py`	Constructor accepts `enable_skills` and `skills_folder`; `invoke()`/`stream()` inject these into `run_config["configurable"]` with `.cuga` normalization to avoid double-suffix issues.
Final answer extraction fallback `src/cuga/sdk.py`	`invoke()` falls back to last AI message `content` or `additional_kwargs["reasoning_content"]` when `final_answer` is empty.

E2E Test Infrastructure & Coverage

Layer / File(s)	Summary
E2E fixtures & helpers `tests/e2e/conftest.py`	Adds `CaptureChatModel`, `MinimalToolProvider`, `write_skill()`, `extract_system_content()`, `real_llm` fixture, and a pytest terminal summary aggregator.
Skills component discovery & registry tests `tests/e2e/test_skills_e2e.py`, `tests/unit/test_skill_loader.py`	Tiered tests for filesystem discovery, YAML frontmatter preservation, requirement parsing (pip/npm), `load_skill()` script formatting and verification ordering, tool creation, and sanitization/path-traversal behavior.
SDK configuration & Tier 3 LLM gating `tests/e2e/test_skills_sdk_e2e.py`, `tests/e2e/test_skills_llm_e2e.py`	Tier 1 SDK wiring checks and Tier 3 gating tests using fabricated skills and real LLMs to assert expected outputs appear when skills are enabled and are absent when disabled.
Presentation & real-skill E2E `tests/e2e/test_skills_presentation_e2e.py`, `tests/e2e/test_skills_real_e2e.py`, `tests/e2e/demo_outputs/.gitignore`	Presentation skill test installs npm deps, runs node tool, asserts .pptx artifact generation and saves sample outputs; real-skill tests fetch upstream SKILL.md and assert expected rule identifiers; demo outputs now ignore `*.pptx`.
E2E harness & CugaLite integration `tests/e2e/test_skills_e2e.py`	Uses `CaptureChatModel` to assert system prompt injection of available skills, bind-tools capture, and behavior when no skills are present.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as create_sandbox_tools
  participant Executor as OpenSandboxExecutor
  participant Lock as asyncio.Lock
  participant Sandbox
  participant CI as CodeInterpreter
  Caller->>Executor: _get_or_create_interpreter(thread_id, requested_skills)
  Executor->>Lock: acquire per-thread lock
  Executor->>Executor: check cached interpreter & health
  alt no healthy interpreter
    Executor->>Sandbox: create sandbox workspace
    Executor->>CI: create CodeInterpreter for sandbox
    Executor->>Executor: _upload_skills_to_sandbox(resolved_cuga_folder) (if enabled)
    Executor->>Executor: cache interpreter and record active_skills_config
  end
  Executor-->>Caller: return cached or new interpreter

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cuga-project/cuga-agent#121: Core skills feature and earlier skills/prompt changes related to discovery and load_skill tooling.
cuga-project/cuga-agent#148: Related changes to create_mcp_prompt that affect skills prompt generation and gating.

Suggested labels

bug

Suggested reviewers

sami-marreed
offerakrabi

Poem

"A rabbit hops through code at night,
polishing prompts until they're right.
It strips the tricks and guards the door,
locks the sandboxes, then writes some more.
Tests bloom bright — the pipeline's tight! 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title "test(e2e): add e2e test suite for knowledge and skills components" is clear, concise, and directly describes the main change—adding end-to-end tests for knowledge and skills functionality.
Linked Issues check	✅ Passed	The PR comprehensively addresses all objectives from issues `#206` and `#311`, including e2e test coverage for knowledge/skills lifecycle, security fixes (prompt injection, path traversal), reliability fixes (concurrency, upload failure handling), and SDK/graph wiring corrections.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to testing infrastructure and addressing skills pipeline bugs across security, reliability, and SDK wiring—no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test/206-e2e-test-for-knowledge-and-skills

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/e2e/test_knowledge_e2e.py (2)

4-7: ⚡ Quick win

Replace EN DASH characters to clear Ruff warnings.

Line 4, Line 6, Line 53, Line 107, Line 150, and Line 204 use –, which triggers RUF002/RUF003. Replace with plain - to keep lint clean.

Proposed diff

-  Tier 1 – component-level (no LLM, no graph): exercises KnowledgeEngine's public
+  Tier 1 - component-level (no LLM, no graph): exercises KnowledgeEngine's public
-  Tier 2 – graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts
+  Tier 2 - graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts
-# Tier 1 – KnowledgeEngine lifecycle
+# Tier 1 - KnowledgeEngine lifecycle
-# Tier 1 – Knowledge awareness (prompt-injection path)
+# Tier 1 - Knowledge awareness (prompt-injection path)
-# Tier 1 – RAG retrieval path (tool boundary)
+# Tier 1 - RAG retrieval path (tool boundary)
-# Tier 2 – CugaLite graph integration
+# Tier 2 - CugaLite graph integration

Also applies to: 53-53, 107-107, 150-150, 204-204

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/test_knowledge_e2e.py` around lines 4 - 7, Replace all EN DASH
characters (–) with ASCII hyphens (-) in the test docstrings/descriptions to
silence Ruff RUF002/RUF003 warnings: update the strings such as "Tier 1 –
component-level (no LLM, no graph)" and "Tier 2 – graph-level
(CaptureChatModel)" and any other occurrences on the noted lines so they use "-"
instead of "–". Search for the literal "–" in tests/e2e/test_knowledge_e2e.py
(e.g., the Tier lines and the other flagged lines) and perform a simple
character substitution, then run the linter to confirm warnings are cleared.

172-175: ⚡ Quick win

Avoid positional tool selection in RAG tests.

Using tools[0] makes these tests order-dependent. Select the knowledge tool by name to prevent brittle failures if provider ordering changes.

Proposed diff

         provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION)
         tools = await provider.get_all_tools()
-        knowledge_tool = tools[0]
+        knowledge_tool = next(
+            (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"),
+            None,
+        )
+        assert knowledge_tool is not None, "knowledge_search_knowledge tool not found"
@@
         provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION)
         tools = await provider.get_all_tools()
-        knowledge_tool = tools[0]
+        knowledge_tool = next(
+            (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"),
+            None,
+        )
+        assert knowledge_tool is not None, "knowledge_search_knowledge tool not found"

Also applies to: 193-195

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/test_knowledge_e2e.py` around lines 172 - 175, The test currently
selects a knowledge tool by position (tools[0]) which is order-dependent;
instead, locate the correct tool returned by
RealSearchKnowledgeToolProvider.get_all_tools() by matching its name/identifier
(e.g., filter for tool.name or tool.id that matches the expected knowledge tool)
and assign that result to knowledge_tool; make the same change for the other
occurrence that uses positional selection so both tests pick the tool by
explicit name rather than by index.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/conftest.py`:
- Around line 236-238: The teardown should guarantee engine.shutdown() runs even
if await engine.aclose() raises: wrap the await engine.aclose() call in a
try/finally (or call engine.shutdown() from a finally block) so that
engine.shutdown() is always invoked; if you need to preserve the original
exception, re-raise it after shutdown or log the aclose() error before
re-raising so resources are not leaked (refer to the engine.aclose() and
engine.shutdown() calls).

---

Nitpick comments:
In `@tests/e2e/test_knowledge_e2e.py`:
- Around line 4-7: Replace all EN DASH characters (–) with ASCII hyphens (-) in
the test docstrings/descriptions to silence Ruff RUF002/RUF003 warnings: update
the strings such as "Tier 1 – component-level (no LLM, no graph)" and "Tier 2 –
graph-level (CaptureChatModel)" and any other occurrences on the noted lines so
they use "-" instead of "–". Search for the literal "–" in
tests/e2e/test_knowledge_e2e.py (e.g., the Tier lines and the other flagged
lines) and perform a simple character substitution, then run the linter to
confirm warnings are cleared.
- Around line 172-175: The test currently selects a knowledge tool by position
(tools[0]) which is order-dependent; instead, locate the correct tool returned
by RealSearchKnowledgeToolProvider.get_all_tools() by matching its
name/identifier (e.g., filter for tool.name or tool.id that matches the expected
knowledge tool) and assign that result to knowledge_tool; make the same change
for the other occurrence that uses positional selection so both tests pick the
tool by explicit name rather than by index.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 364c908c-72b0-4424-b9c4-177311503a5e

📥 Commits

Reviewing files that changed from the base of the PR and between 5a38799 and bc13f53.

📒 Files selected for processing (4)

tests/e2e/__init__.py
tests/e2e/conftest.py
tests/e2e/test_knowledge_e2e.py
tests/e2e/test_skills_e2e.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/test_skills_e2e.py`:
- Line 99: Replace hardcoded "/tmp" string literals in the test fixtures with a
temporary directory fixture; locate instances like
source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py and change them
to use the pytest tmp_path (or tempfile.gettempdir/tmp_path_factory) to
construct paths (e.g., str(tmp_path / name / "SKILL.md")), and update the other
occurrences on the same file (lines referencing "/tmp/...") similarly so all
test paths are built from the tmp_path/tempfile API instead of a hardcoded
"/tmp".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba2ee880-4eb2-40ad-af65-317306217a57

📥 Commits

Reviewing files that changed from the base of the PR and between f433c8b and 7e2734f.

📒 Files selected for processing (3)

tests/e2e/conftest.py
tests/e2e/test_knowledge_e2e.py
tests/e2e/test_skills_e2e.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/e2e/test_knowledge_e2e.py
tests/e2e/conftest.py

coderabbitai · 2026-05-17T11:23:41Z

+            name=name,
+            description=f"{name} description",
+            body=body,
+            source=f"/tmp/{name}/SKILL.md",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace hardcoded /tmp literals to avoid Ruff S108 CI failures.

Line 99, Line 164, Line 176, Line 177, and Line 187 use hardcoded /tmp/... paths in test fixtures. Ruff flags these as errors (S108), which can block CI even though these are test-only strings.

Proposed patch

- source=f"/tmp/{name}/SKILL.md", + source=f"skills/{name}/SKILL.md",

- source="/tmp/SKILL.md", + source="skills/SKILL.md",

- SkillEntry("alpha", "Alpha skill", "## Body", "/tmp/a/SKILL.md", ()), - SkillEntry("beta", "Beta skill", "## Body", "/tmp/b/SKILL.md", ()), + SkillEntry("alpha", "Alpha skill", "## Body", "skills/a/SKILL.md", ()), + SkillEntry("beta", "Beta skill", "## Body", "skills/b/SKILL.md", ()),

- entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "/tmp/g/SKILL.md", ()) + entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "skills/g/SKILL.md", ())

Also applies to: 164-164, 176-177, 187-187

🧰 Tools

🪛 Ruff (0.15.12)

[error] 99-99: Probable insecure usage of temporary file or directory: "/tmp/"

(S108)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/test_skills_e2e.py` at line 99, Replace hardcoded "/tmp" string literals in the test fixtures with a temporary directory fixture; locate instances like source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py and change them to use the pytest tmp_path (or tempfile.gettempdir/tmp_path_factory) to construct paths (e.g., str(tmp_path / name / "SKILL.md")), and update the other occurrences on the same file (lines referencing "/tmp/...") similarly so all test paths are built from the tmp_path/tempfile API instead of a hardcoded "/tmp".

Implements the full test coverage requested in issue #206. **What's covered** Knowledge (`tests/e2e/test_knowledge_e2e.py`): - Tier 1 – KnowledgeEngine lifecycle: ingest → search → delete → replace cycle - Tier 1 – Awareness path: get_knowledge_summary() output and format_knowledge_context() collection naming / sanitization - Tier 1 – RAG retrieval path: RealSearchKnowledgeToolProvider calls engine.search() directly, asserts ingested content is returned by the tool - Tier 2 – CugaLiteGraph integration: agent-scoped and session-scoped knowledge summaries injected into the system prompt; knowledge + skills coexistence Skills (`tests/e2e/test_skills_e2e.py`): - Tier 1 – Discovery: discover_skills() finds all skills, preserves description, parses pip and npm requirements - Tier 1 – Registry: load_skill() body ordering, uv/npm install commands, mixed deps, normalization hint - Tier 1 – Tools/block: create_skill_tools() returns load_skill; format_available_skills_block() lists names and descriptions - Tier 2 – CugaLiteGraph integration: skills block appears in system prompt; load_skill bound to model in native tool-calling mode - Blocked: test_skill_executed_via_sub_agent skipped pending #199 **Shared infrastructure (`tests/e2e/conftest.py`)** - CaptureChatModel: hermetic mock LLM that records inputs and replays scripted responses - KnowledgeToolProvider: stub search tool (triggers awareness detection path) - RealSearchKnowledgeToolProvider: real search tool backed by KnowledgeEngine (covers RAG path) - knowledge_engine fixture: isolated fastembed + sqlite-vec engine in tmp_path Closes #206

- conftest.py: wrap engine.aclose() in try/finally so shutdown() is always called even if aclose() raises (resource leak fix) - test_knowledge_e2e.py: replace EN dash characters with ASCII hyphens in docstrings/comments to silence Ruff RUF002/RUF003 - test_knowledge_e2e.py: select knowledge_search_knowledge tool by name instead of positional index tools[0] to avoid order-dependent failures

What was broken Security 1 — Prompt injection via SKILL.md frontmatter (loader.py) Jinja2 delimiters ({{ }}, {% %}, {# #}) in a skill's name or description were forwarded unescaped into the system prompt template. A malicious or misconfigured SKILL.md could inject arbitrary template directives. Fixed by stripping matching patterns at parse time and logging a warning. 2 — Path traversal in skill names (loader.py) A skill named ../../etc/passwd was accepted and later used to construct file paths during sandbox upload. Fixed by rejecting names that contain /, \, or ... --- Reliability 3 — Concurrent sandbox creation race (opensandbox_executor.py) Two coroutines calling _get_or_create_interpreter for the same thread_id simultaneously could both pass the cache check and create two remote containers, orphaning one. Fixed with a per-key asyncio.Lock. 4 — Skills upload failure killed the sandbox (opensandbox_executor.py) If _upload_skills_to_sandbox raised (e.g. transient network error), the sandbox was discarded rather than cached. On the next request a new container was created, leaking the first one. Fixed by catching the exception, logging a warning, and caching the sandbox regardless. 5 — Package install not verified (registry.py) After generating a uv pip install / npm install step in the skill script, there was no verification that the install actually succeeded. Fixed by appending uv pip show <pkgs> / npm list <pkgs> so the agent surfaces install failures immediately instead of getting a confusing import error later. 6 — Silent skill name collision (loader.py) When two SKILL.md files declared the same name, the later one silently replaced the earlier one. Now logs a DEBUG message identifying both paths. --- SDK / graph wiring 7 — enable_skills / skills_folder ignored by the graph (sdk.py, cuga_lite_graph.py) CugaAgent(enable_skills=True, skills_folder=...) stored the values but never forwarded them into run_config[configurable]. The graph always fell back to settings.skills.enabled and CUGA_FOLDER, making the SDK parameters a silent no-op. Fixed for both invoke() and chat(). 8 — Double .cuga suffix (sdk.py) If a caller passed a path already ending in .cuga as skills_folder, the SDK appended another .cuga, directing discovery to the wrong location with no error. Fixed with a guard that skips the suffix when already present. 9 — enable_shell_tool=False suppressed skills silently (prompt_utils.py) When advanced_features.enable_shell_tool=False, the skills block was cleared with no indication. Fixed by emitting a WARNING log before clearing it. 10 — final_answer empty for reasoning models (sdk.py) Some reasoning models return content='' with the answer in additional_kwargs[reasoning_content]. The answer-extraction loop missed this field and fell through to the interrupt-handling path, returning an empty answer. Fixed with a fallback pass over chat_messages. --- Tests added Shared infrastructure — tests/e2e/conftest.py - CaptureChatModel — hermetic mock LLM that records all inputs and replays a scripted response queue; fails loudly if the graph makes more calls than expected - MinimalToolProvider — no-op tool provider base for test isolation - write_skill / extract_system_content — helpers used across all skill test tiers - real_llm fixture — builds the real configured LLM from project settings for Tier 3 tests - pytest_terminal_summary hook — prints a pass/fail table of all Tier 3 real-LLM results at session end Component & graph — tests/e2e/test_skills_e2e.py ┌──────┬────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Tier │ Class │ Coverage │ ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 1 │ TestSkillDiscovery │ filesystem scan, multiple skills, description preserved, pip/npm requirements parsed │ ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 1 │ TestSkillRegistry │ load_skill returns body, install preamble generated for pip/npm, missing skill raises │ ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 1 │ TestSkillTools │ create_skill_tools returns a load_skill StructuredTool; calling it returns the body │ ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 1 │ TestSkillBlock │ format_available_skills_block includes all names and descriptions │ ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 2 │ TestSkillsGraphIntegration │ CugaLiteGraph + CaptureChatModel — skills block present in system message; load_skill bound to model │ └──────┴────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘ Real LLM — tests/e2e/test_skills_llm_e2e.py Three skills containing fabricated proprietary data (formula coefficients, internal part codes, invented system names) that the model cannot know from training. Each positive test asserts the expected value appears; each paired negative control runs with skills disabled and asserts it is absent — confirming the skill is the genuine gating factor, not the model's prior knowledge. SDK surface — tests/e2e/test_skills_sdk_e2e.py - Tier 1: enable_skills / skills_folder stored on agent; forwarded into configurable; .cuga not double-appended; plain workspace root gets .cuga appended - Tier 3: same three fabricated-data skills driven through CugaAgent.invoke() instead of the raw graph Real public skills — tests/e2e/test_skills_real_e2e.py Downloads the real Vercel agent-skills SKILL.md, invokes the agent, and asserts that taxonomy-specific rule identifiers (long enough to be unlikely from training alone) appear in the answer. Skips automatically when GitHub is unreachable. Presentation demo — tests/e2e/test_skills_presentation_e2e.py Downloads the real pptx skill from anthropics/skills, pre-installs pptxgenjs, provides a run_node_script tool, and asserts a non-trivially-sized .pptx file is produced. Output saved to tests/e2e/demo_outputs/ for manual inspection. Run with -m e2e. Unit — tests/unit/test_skill_loader.py (additions) - _copy_skills_to_workspace copies SKILL.md into the per-thread workspace - _copy_skills_to_workspace is a no-op when skills_enabled=False - SKILL.md with a path-traversal name is silently rejected Unit — tests/unit/test_opensandbox_executor.py - Concurrent _get_or_create_interpreter calls produce exactly one Sandbox.create call (lock serialises creation) - Upload failure logs a warning but the sandbox is still cached (no orphan container) - create_sandbox_tools warns when the requested skills config differs from what was active at sandbox creation time

- Reformat cuga_lite_graph.py, local/native/opensandbox executors, and skills loader - Reformat e2e test files (test_skills_e2e, test_skills_llm_e2e, test_skills_sdk_e2e) - Reformat unit test files (test_opensandbox_executor, test_skill_loader)

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py (1)
282-302: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing _skills_config population — skills overrides are never applied.

Unlike LocalSandboxExecutor and NativeSandboxExecutor, this executor's create_sandbox_tools() does not accept cuga_folder or skills_enabled parameters and never sets self._skills_config[key]. When _get_or_create_interpreter() later reads _skills_config.get(key, (None, None)) at line 164, it always gets the default, so caller-supplied skills configuration is ignored.

The unit tests work around this by manually setting executor._skills_config["thread-A"] = (...) before calling the interpreter method, but real callers won't do this.
🐛 Proposed fix to add skills config parameters and storage
     def create_sandbox_tools(
         self,
         thread_id: Optional[str] = None,
+        cuga_folder: Optional[str] = None,
+        skills_enabled: Optional[bool] = None,
     ) -> list[StructuredTool]:
         """Return the run_command StructuredTool bound to thread_id.

         Filesystem tools (read/write/list/edit/...) now come from the
         consolidated ``filesystem`` package via a ``RemoteSandboxBackend``
         (wired in ``cuga_lite_graph``); they are no longer produced here.
         """
+        key = thread_id or "_default"
+        self._skills_config[key] = (cuga_folder, skills_enabled)
         return [
             StructuredTool.from_function(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`
around lines 282 - 302, create_sandbox_tools currently ignores caller-provided
skills overrides so _get_or_create_interpreter always sees defaults; change
create_sandbox_tools(self, thread_id: Optional[str] = None) to accept the skills
params (e.g. add cuga_folder: Optional[str]=None, skills_enabled:
Optional[bool]=None), compute the key used by _get_or_create_interpreter (same
thread_id logic), and store the tuple into self._skills_config[key] =
(cuga_folder, skills_enabled) before returning the StructuredTool; keep the
StructuredTool.from_function call using self.create_run_command_tool(thread_id)
unchanged so only the skills config storage is added.

🧹 Nitpick comments (1)

src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py (1)

91-97: 💤 Low value

Mutable class-level dictionaries are shared across all instances.

These dictionaries are defined at class level, meaning all OpenSandboxExecutor instances share the same cache. While this may be intentional for singleton-style usage, it's fragile if multiple instances are created. Consider using ClassVar type hints to make the intent explicit, or initialize these in __init__ if instance isolation is needed.

💡 Suggested type annotation to document intent

+from typing import Any, Callable, ClassVar, Optional
+
 class OpenSandboxExecutor(RemoteExecutor):
     """Provides a persistent sandbox with shell + filesystem tools."""

     # Interpreter cache: thread_id -> CodeInterpreter instance
-    _sandboxes: dict[str, Any] = {}
+    _sandboxes: ClassVar[dict[str, Any]] = {}
     # Per-thread skills config requested on the current invocation
-    _skills_config: dict[str, tuple[Optional[str], Optional[bool]]] = {}
+    _skills_config: ClassVar[dict[str, tuple[Optional[str], Optional[bool]]]] = {}
     # Skills config that was actually applied when the sandbox was created
-    _active_skills_config: dict[str, tuple[Optional[str], Optional[bool]]] = {}
+    _active_skills_config: ClassVar[dict[str, tuple[Optional[str], Optional[bool]]]] = {}
     # Per-key asyncio locks to prevent concurrent sandbox creation for the same thread
-    _locks: dict[str, asyncio.Lock] = {}
+    _locks: ClassVar[dict[str, asyncio.Lock]] = {}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`
around lines 91 - 97, The class-level mutable dicts _sandboxes, _skills_config,
_active_skills_config, and _locks on OpenSandboxExecutor are shared across all
instances; either make the intent explicit by annotating them as ClassVar[...]
(to document shared, singleton-style state) or move their initialization into
the instance __init__ (self._sandboxes, self._skills_config,
self._active_skills_config, self._locks) to ensure instance isolation; update
type hints accordingly and ensure all methods that reference these symbols use
self.<name> when you switch to instance attributes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 282-302: create_sandbox_tools currently ignores caller-provided
skills overrides so _get_or_create_interpreter always sees defaults; change
create_sandbox_tools(self, thread_id: Optional[str] = None) to accept the skills
params (e.g. add cuga_folder: Optional[str]=None, skills_enabled:
Optional[bool]=None), compute the key used by _get_or_create_interpreter (same
thread_id logic), and store the tuple into self._skills_config[key] =
(cuga_folder, skills_enabled) before returning the StructuredTool; keep the
StructuredTool.from_function call using self.create_run_command_tool(thread_id)
unchanged so only the skills config storage is added.

---

Nitpick comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 91-97: The class-level mutable dicts _sandboxes, _skills_config,
_active_skills_config, and _locks on OpenSandboxExecutor are shared across all
instances; either make the intent explicit by annotating them as ClassVar[...]
(to document shared, singleton-style state) or move their initialization into
the instance __init__ (self._sandboxes, self._skills_config,
self._active_skills_config, self._locks) to ensure instance isolation; update
type hints accordingly and ensure all methods that reference these symbols use
self.<name> when you switch to instance attributes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7fa69b50-f1ff-4626-a376-59fb19beb610

📥 Commits

Reviewing files that changed from the base of the PR and between 7e2734f and 3cac208.

📒 Files selected for processing (17)

src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/local/local_sandbox_executor.py
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/native/native_sandbox_executor.py
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
src/cuga/backend/skills/loader.py
src/cuga/backend/skills/registry.py
src/cuga/sdk.py
tests/e2e/__init__.py
tests/e2e/conftest.py
tests/e2e/demo_outputs/.gitignore
tests/e2e/test_skills_e2e.py
tests/e2e/test_skills_llm_e2e.py
tests/e2e/test_skills_presentation_e2e.py
tests/e2e/test_skills_real_e2e.py
tests/e2e/test_skills_sdk_e2e.py
tests/unit/test_opensandbox_executor.py
tests/unit/test_skill_loader.py

💤 Files with no reviewable changes (10)

tests/e2e/demo_outputs/.gitignore
tests/unit/test_opensandbox_executor.py
tests/unit/test_skill_loader.py
src/cuga/sdk.py
tests/e2e/test_skills_presentation_e2e.py
tests/e2e/test_skills_sdk_e2e.py
tests/e2e/test_skills_llm_e2e.py
tests/e2e/conftest.py
tests/e2e/test_skills_real_e2e.py
tests/e2e/test_skills_e2e.py

✅ Files skipped from review due to trivial changes (1)

src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py

- Add cuga_folder and skills_enabled optional params (matches local/native) - Store (cuga_folder, skills_enabled) into _skills_config[key] so that _get_or_create_interpreter reads the caller-supplied overrides instead of always falling back to settings defaults

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 294-295: When updating _skills_config for a given key (computed
from thread_id or "_default"), first check if a live interpreter exists in
self._interpreters for that same key and compare its active config to the
incoming (cuga_folder, skills_enabled); if they differ, emit a warning that the
new config will be ignored until release_sandbox() is called (so callers know
the per-call override won’t take effect). Implement this check alongside the
existing assignment in the same place where _skills_config is updated,
referencing _get_or_create_interpreter, _interpreters, _skills_config, and
release_sandbox in the message so the log is clear and matches the tested
behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51b9f8ed-9ea4-479c-9814-31377c333be9

📥 Commits

Reviewing files that changed from the base of the PR and between 3cac208 and f675da3.

📒 Files selected for processing (1)

src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py

coderabbitai · 2026-06-04T12:19:57Z

+        key = thread_id or "_default"
+        self._skills_config[key] = (cuga_folder, skills_enabled)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Warn when a live sandbox's skills config changes.

Line 295 updates _skills_config, but _get_or_create_interpreter() returns the cached live interpreter on Lines 131-135 before any new config can be applied. That means a different cuga_folder/skills_enabled is silently ignored until release_sandbox() is called, which breaks the per-call override behavior and the stale-config warning path exercised in tests/unit/test_opensandbox_executor.py:251-278.

Suggested fix

key = thread_id or "_default" - self._skills_config[key] = (cuga_folder, skills_enabled) + requested = (cuga_folder, skills_enabled) + active = self._active_skills_config.get(key) + if key in self._sandboxes and active is not None and requested != active: + logger.warning( + "[OpenSandboxExecutor] Skills config changed for a live sandbox; " + "call release_sandbox() before requesting different " + "cuga_folder/skills_enabled values." + ) + self._skills_config[key] = requested

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py` around lines 294 - 295, When updating _skills_config for a given key (computed from thread_id or "_default"), first check if a live interpreter exists in self._interpreters for that same key and compare its active config to the incoming (cuga_folder, skills_enabled); if they differ, emit a warning that the new config will be ignored until release_sandbox() is called (so callers know the per-call override won’t take effect). Implement this check alongside the existing assignment in the same place where _skills_config is updated, referencing _get_or_create_interpreter, _interpreters, _skills_config, and release_sandbox in the message so the log is clear and matches the tested behavior.

Iftach-Shoham requested a review from sami-marreed May 17, 2026 07:59

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread tests/e2e/conftest.py Outdated

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

sami-marreed marked this pull request as draft May 17, 2026 12:35

Iftach Shoham and others added 5 commits June 4, 2026 14:49

style(e2e): apply ruff format to e2e test files

d018e38

style: apply ruff format

3cac208

- Reformat cuga_lite_graph.py, local/native/opensandbox executors, and skills loader - Reformat e2e test files (test_skills_e2e, test_skills_llm_e2e, test_skills_sdk_e2e) - Reformat unit test files (test_opensandbox_executor, test_skill_loader)

Iftach-Shoham force-pushed the test/206-e2e-test-for-knowledge-and-skills branch from e838da9 to 3cac208 Compare June 4, 2026 11:58

Iftach-Shoham mentioned this pull request Jun 4, 2026

[Bug]: Skills pipeline — 10 reliability, security, and SDK wiring issues #311

Open

Iftach-Shoham marked this pull request as ready for review June 4, 2026 12:00

Iftach-Shoham mentioned this pull request Jun 4, 2026

[Epic]: Support Agent Skills via On-Demand SKILL.md Discovery #91

Open

4 tasks

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): add e2e test suite for knowledge and skills components#232

test(e2e): add e2e test suite for knowledge and skills components#232
Iftach-Shoham wants to merge 6 commits into
mainfrom
test/206-e2e-test-for-knowledge-and-skills

Iftach-Shoham commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 17, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		key = thread_id or "_default"
		self._skills_config[key] = (cuga_folder, skills_enabled)

Conversation

Iftach-Shoham commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's covered

Knowledge (test_knowledge_e2e.py) — 12 tests

Skills (test_skills_e2e.py) — 15 tests + 1 skip

Shared infrastructure (conftest.py)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Iftach-Shoham commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Knowledge (`test_knowledge_e2e.py`) — 12 tests

Skills (`test_skills_e2e.py`) — 15 tests + 1 skip

Shared infrastructure (`conftest.py`)

coderabbitai Bot commented May 17, 2026 •

edited

Loading