Skip to content

test(e2e): add e2e test suite for knowledge and skills components#232

Open
Iftach-Shoham wants to merge 6 commits into
mainfrom
test/206-e2e-test-for-knowledge-and-skills
Open

test(e2e): add e2e test suite for knowledge and skills components#232
Iftach-Shoham wants to merge 6 commits into
mainfrom
test/206-e2e-test-for-knowledge-and-skills

Conversation

@Iftach-Shoham

@Iftach-Shoham Iftach-Shoham commented May 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Closes #206.
Closes #311.

Adds a new tests/e2e/ directory with full end-to-end test coverage for the knowledge and skills components as requested in the issue. Tests are structured in two tiers per component: Tier 1 exercises the component API directly (no LLM, no graph — fast and surgical), Tier 2 runs the real compiled CugaLiteGraph with a hermetic mock LLM and asserts on what actually reached the model.


What's covered

Knowledge (test_knowledge_e2e.py) — 12 tests

Class What it tests
TestKnowledgeEngineLifecycle ingest → search → delete → re-ingest (stale chunk purge)
TestKnowledgeAwareness get_knowledge_summary() output; format_knowledge_context() naming and special-char sanitization
TestKnowledgeRagPath real knowledge_search_knowledge tool via RealSearchKnowledgeToolProvider calls engine.search() and returns ingested content; stale chunk not returned after replace
TestKnowledgeCugaLiteIntegration agent-scoped knowledge in system prompt; session-scoped knowledge in system prompt; knowledge + skills coexistence — all assert both filename and document content appear

Skills (test_skills_e2e.py) — 15 tests + 1 skip

Class What it tests
TestSkillDiscovery discover_skills() finds single/multiple skills, preserves description, parses pip and npm requirements
TestSkillRegistry load_skill() body order, uv pip install / npm install commands, mixed deps (STEP 1 before STEP 2), normalization hint
TestSkillToolsAndBlock create_skill_tools() returns load_skill; block contains names and descriptions
TestSkillsCugaLiteIntegration skill name in system prompt; load_skill bound to model in native tool-calling mode
TestSkillsBlockedPaths test_skill_executed_via_sub_agent — skipped, blocked on #199

Shared infrastructure (conftest.py)

  • CaptureChatModel — hermetic mock LLM: records every message list and tool binding, replays scripted responses, raises on queue exhaustion (catches unexpected graph loops)
  • KnowledgeToolProvider — stub knowledge_search_knowledge tool (triggers the startswith("knowledge_") detection in cuga_lite_graph.py:1493) without requiring a running MCP server
  • RealSearchKnowledgeToolProvider — real knowledge_search_knowledge tool backed by engine.search(), covers the RAG retrieval path
  • knowledge_engine fixture — isolated fastembed + sqlite-vec engine in tmp_path, monkeypatched off any real DB

Test plan

Summary by CodeRabbit

  • New Features

    • Agent "skills" support: enable/disable per-agent, configurable skills folder, and tooling to load skills at runtime.
    • Post-install verification added for pip and npm dependencies.
  • Bug Fixes / Improvements

    • Per-thread sandbox and tool lifecycle improvements to reliably apply per-request skills settings and avoid concurrent creation issues.
    • Prompt now warns when skills are requested but required shell tooling is disabled.
  • Security

    • Frontmatter sanitization to prevent prompt-injection and rejection of skill names with path traversal.
  • Tests

    • Extensive end-to-end and unit tests added for skills, SDK wiring, LLM gating, presentation skill, and sandbox behavior.

@coderabbitai

coderabbitai Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Sanitizes SKILL.md frontmatter, rejects unsafe skill names, adds install verification, enables per-call skills overrides, makes OpenSandbox concurrency-safe and resilient, forwards SDK skills options into runs, and adds comprehensive unit and multi-tier e2e tests including real-LLM scenarios.

Changes

Skills Security & Reliability

Layer / File(s) Summary
Jinja2 sanitization & path traversal prevention
src/cuga/backend/skills/loader.py
Strips Jinja2 delimiters from skill name/description frontmatter and rejects sanitized names containing /, \, or ...
Skills verification commands
src/cuga/backend/skills/registry.py
load_skill() now appends uv pip show ... for pip and npm list ... for npm to verify installs.
OpenSandbox concurrency & upload resilience
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py, tests/unit/test_opensandbox_executor.py
Adds per-thread asyncio locks, requested/active skills config, logs and caches interpreter on upload failures, resolves cuga_folder from argument→env→policy, and clears per-thread state on release; unit tests cover concurrency, upload-failure caching, stale-config warnings, and release cleanup.
Executor skill parameter plumbing
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/local/local_sandbox_executor.py, native/native_sandbox_executor.py
_copy_skills_to_workspace and create_sandbox_tools accept cuga_folder and skills_enabled overrides and resolve the effective folder from argument→env→policy.
Prompt gating warning
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
Logs a warning when skills_enabled=True but advanced_features.enable_shell_tool=False.

SDK Skills Integration & Answer Fallback

Layer / File(s) Summary
CugaAgent skills parameters & graph wiring
src/cuga/sdk.py
Constructor accepts enable_skills and skills_folder; invoke()/stream() inject these into run_config["configurable"] with .cuga normalization to avoid double-suffix issues.
Final answer extraction fallback
src/cuga/sdk.py
invoke() falls back to last AI message content or additional_kwargs["reasoning_content"] when final_answer is empty.

E2E Test Infrastructure & Coverage

Layer / File(s) Summary
E2E fixtures & helpers
tests/e2e/conftest.py
Adds CaptureChatModel, MinimalToolProvider, write_skill(), extract_system_content(), real_llm fixture, and a pytest terminal summary aggregator.
Skills component discovery & registry tests
tests/e2e/test_skills_e2e.py, tests/unit/test_skill_loader.py
Tiered tests for filesystem discovery, YAML frontmatter preservation, requirement parsing (pip/npm), load_skill() script formatting and verification ordering, tool creation, and sanitization/path-traversal behavior.
SDK configuration & Tier 3 LLM gating
tests/e2e/test_skills_sdk_e2e.py, tests/e2e/test_skills_llm_e2e.py
Tier 1 SDK wiring checks and Tier 3 gating tests using fabricated skills and real LLMs to assert expected outputs appear when skills are enabled and are absent when disabled.
Presentation & real-skill E2E
tests/e2e/test_skills_presentation_e2e.py, tests/e2e/test_skills_real_e2e.py, tests/e2e/demo_outputs/.gitignore
Presentation skill test installs npm deps, runs node tool, asserts .pptx artifact generation and saves sample outputs; real-skill tests fetch upstream SKILL.md and assert expected rule identifiers; demo outputs now ignore *.pptx.
E2E harness & CugaLite integration
tests/e2e/test_skills_e2e.py
Uses CaptureChatModel to assert system prompt injection of available skills, bind-tools capture, and behavior when no skills are present.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as create_sandbox_tools
  participant Executor as OpenSandboxExecutor
  participant Lock as asyncio.Lock
  participant Sandbox
  participant CI as CodeInterpreter
  Caller->>Executor: _get_or_create_interpreter(thread_id, requested_skills)
  Executor->>Lock: acquire per-thread lock
  Executor->>Executor: check cached interpreter & health
  alt no healthy interpreter
    Executor->>Sandbox: create sandbox workspace
    Executor->>CI: create CodeInterpreter for sandbox
    Executor->>Executor: _upload_skills_to_sandbox(resolved_cuga_folder) (if enabled)
    Executor->>Executor: cache interpreter and record active_skills_config
  end
  Executor-->>Caller: return cached or new interpreter
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

bug

Suggested reviewers

  • sami-marreed
  • offerakrabi

Poem

"A rabbit hops through code at night,
polishing prompts until they're right.
It strips the tricks and guards the door,
locks the sandboxes, then writes some more.
Tests bloom bright — the pipeline's tight! 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title "test(e2e): add e2e test suite for knowledge and skills components" is clear, concise, and directly describes the main change—adding end-to-end tests for knowledge and skills functionality.
Linked Issues check ✅ Passed The PR comprehensively addresses all objectives from issues #206 and #311, including e2e test coverage for knowledge/skills lifecycle, security fixes (prompt injection, path traversal), reliability fixes (concurrency, upload failure handling), and SDK/graph wiring corrections.
Out of Scope Changes check ✅ Passed All changes are directly scoped to testing infrastructure and addressing skills pipeline bugs across security, reliability, and SDK wiring—no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/206-e2e-test-for-knowledge-and-skills

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/e2e/test_knowledge_e2e.py (2)

4-7: ⚡ Quick win

Replace EN DASH characters to clear Ruff warnings.

Line 4, Line 6, Line 53, Line 107, Line 150, and Line 204 use , which triggers RUF002/RUF003. Replace with plain - to keep lint clean.

Proposed diff
-  Tier 1 – component-level (no LLM, no graph): exercises KnowledgeEngine's public
+  Tier 1 - component-level (no LLM, no graph): exercises KnowledgeEngine's public
-  Tier 2 – graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts
+  Tier 2 - graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts
-# Tier 1 – KnowledgeEngine lifecycle
+# Tier 1 - KnowledgeEngine lifecycle
-# Tier 1 – Knowledge awareness (prompt-injection path)
+# Tier 1 - Knowledge awareness (prompt-injection path)
-# Tier 1 – RAG retrieval path (tool boundary)
+# Tier 1 - RAG retrieval path (tool boundary)
-# Tier 2 – CugaLite graph integration
+# Tier 2 - CugaLite graph integration

Also applies to: 53-53, 107-107, 150-150, 204-204

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/test_knowledge_e2e.py` around lines 4 - 7, Replace all EN DASH
characters (–) with ASCII hyphens (-) in the test docstrings/descriptions to
silence Ruff RUF002/RUF003 warnings: update the strings such as "Tier 1 –
component-level (no LLM, no graph)" and "Tier 2 – graph-level
(CaptureChatModel)" and any other occurrences on the noted lines so they use "-"
instead of "–". Search for the literal "–" in tests/e2e/test_knowledge_e2e.py
(e.g., the Tier lines and the other flagged lines) and perform a simple
character substitution, then run the linter to confirm warnings are cleared.

172-175: ⚡ Quick win

Avoid positional tool selection in RAG tests.

Using tools[0] makes these tests order-dependent. Select the knowledge tool by name to prevent brittle failures if provider ordering changes.

Proposed diff
         provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION)
         tools = await provider.get_all_tools()
-        knowledge_tool = tools[0]
+        knowledge_tool = next(
+            (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"),
+            None,
+        )
+        assert knowledge_tool is not None, "knowledge_search_knowledge tool not found"
@@
         provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION)
         tools = await provider.get_all_tools()
-        knowledge_tool = tools[0]
+        knowledge_tool = next(
+            (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"),
+            None,
+        )
+        assert knowledge_tool is not None, "knowledge_search_knowledge tool not found"

Also applies to: 193-195

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/test_knowledge_e2e.py` around lines 172 - 175, The test currently
selects a knowledge tool by position (tools[0]) which is order-dependent;
instead, locate the correct tool returned by
RealSearchKnowledgeToolProvider.get_all_tools() by matching its name/identifier
(e.g., filter for tool.name or tool.id that matches the expected knowledge tool)
and assign that result to knowledge_tool; make the same change for the other
occurrence that uses positional selection so both tests pick the tool by
explicit name rather than by index.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/conftest.py`:
- Around line 236-238: The teardown should guarantee engine.shutdown() runs even
if await engine.aclose() raises: wrap the await engine.aclose() call in a
try/finally (or call engine.shutdown() from a finally block) so that
engine.shutdown() is always invoked; if you need to preserve the original
exception, re-raise it after shutdown or log the aclose() error before
re-raising so resources are not leaked (refer to the engine.aclose() and
engine.shutdown() calls).

---

Nitpick comments:
In `@tests/e2e/test_knowledge_e2e.py`:
- Around line 4-7: Replace all EN DASH characters (–) with ASCII hyphens (-) in
the test docstrings/descriptions to silence Ruff RUF002/RUF003 warnings: update
the strings such as "Tier 1 – component-level (no LLM, no graph)" and "Tier 2 –
graph-level (CaptureChatModel)" and any other occurrences on the noted lines so
they use "-" instead of "–". Search for the literal "–" in
tests/e2e/test_knowledge_e2e.py (e.g., the Tier lines and the other flagged
lines) and perform a simple character substitution, then run the linter to
confirm warnings are cleared.
- Around line 172-175: The test currently selects a knowledge tool by position
(tools[0]) which is order-dependent; instead, locate the correct tool returned
by RealSearchKnowledgeToolProvider.get_all_tools() by matching its
name/identifier (e.g., filter for tool.name or tool.id that matches the expected
knowledge tool) and assign that result to knowledge_tool; make the same change
for the other occurrence that uses positional selection so both tests pick the
tool by explicit name rather than by index.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 364c908c-72b0-4424-b9c4-177311503a5e

📥 Commits

Reviewing files that changed from the base of the PR and between 5a38799 and bc13f53.

📒 Files selected for processing (4)
  • tests/e2e/__init__.py
  • tests/e2e/conftest.py
  • tests/e2e/test_knowledge_e2e.py
  • tests/e2e/test_skills_e2e.py

Comment thread tests/e2e/conftest.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/test_skills_e2e.py`:
- Line 99: Replace hardcoded "/tmp" string literals in the test fixtures with a
temporary directory fixture; locate instances like
source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py and change them
to use the pytest tmp_path (or tempfile.gettempdir/tmp_path_factory) to
construct paths (e.g., str(tmp_path / name / "SKILL.md")), and update the other
occurrences on the same file (lines referencing "/tmp/...") similarly so all
test paths are built from the tmp_path/tempfile API instead of a hardcoded
"/tmp".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba2ee880-4eb2-40ad-af65-317306217a57

📥 Commits

Reviewing files that changed from the base of the PR and between f433c8b and 7e2734f.

📒 Files selected for processing (3)
  • tests/e2e/conftest.py
  • tests/e2e/test_knowledge_e2e.py
  • tests/e2e/test_skills_e2e.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/e2e/test_knowledge_e2e.py
  • tests/e2e/conftest.py

name=name,
description=f"{name} description",
body=body,
source=f"/tmp/{name}/SKILL.md",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace hardcoded /tmp literals to avoid Ruff S108 CI failures.

Line 99, Line 164, Line 176, Line 177, and Line 187 use hardcoded /tmp/... paths in test fixtures. Ruff flags these as errors (S108), which can block CI even though these are test-only strings.

Proposed patch
-            source=f"/tmp/{name}/SKILL.md",
+            source=f"skills/{name}/SKILL.md",
-            source="/tmp/SKILL.md",
+            source="skills/SKILL.md",
-            SkillEntry("alpha", "Alpha skill", "## Body", "/tmp/a/SKILL.md", ()),
-            SkillEntry("beta", "Beta skill", "## Body", "/tmp/b/SKILL.md", ()),
+            SkillEntry("alpha", "Alpha skill", "## Body", "skills/a/SKILL.md", ()),
+            SkillEntry("beta", "Beta skill", "## Body", "skills/b/SKILL.md", ()),
-        entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "/tmp/g/SKILL.md", ())
+        entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "skills/g/SKILL.md", ())

Also applies to: 164-164, 176-177, 187-187

🧰 Tools
🪛 Ruff (0.15.12)

[error] 99-99: Probable insecure usage of temporary file or directory: "/tmp/"

(S108)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/test_skills_e2e.py` at line 99, Replace hardcoded "/tmp" string
literals in the test fixtures with a temporary directory fixture; locate
instances like source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py
and change them to use the pytest tmp_path (or
tempfile.gettempdir/tmp_path_factory) to construct paths (e.g., str(tmp_path /
name / "SKILL.md")), and update the other occurrences on the same file (lines
referencing "/tmp/...") similarly so all test paths are built from the
tmp_path/tempfile API instead of a hardcoded "/tmp".

@sami-marreed sami-marreed marked this pull request as draft May 17, 2026 12:35
Iftach Shoham and others added 5 commits June 4, 2026 14:49
Implements the full test coverage requested in issue #206.

**What's covered**

Knowledge (`tests/e2e/test_knowledge_e2e.py`):
- Tier 1 – KnowledgeEngine lifecycle: ingest → search → delete → replace cycle
- Tier 1 – Awareness path: get_knowledge_summary() output and format_knowledge_context() collection naming / sanitization
- Tier 1 – RAG retrieval path: RealSearchKnowledgeToolProvider calls engine.search() directly, asserts ingested content is returned by the tool
- Tier 2 – CugaLiteGraph integration: agent-scoped and session-scoped knowledge summaries injected into the system prompt; knowledge + skills coexistence

Skills (`tests/e2e/test_skills_e2e.py`):
- Tier 1 – Discovery: discover_skills() finds all skills, preserves description, parses pip and npm requirements
- Tier 1 – Registry: load_skill() body ordering, uv/npm install commands, mixed deps, normalization hint
- Tier 1 – Tools/block: create_skill_tools() returns load_skill; format_available_skills_block() lists names and descriptions
- Tier 2 – CugaLiteGraph integration: skills block appears in system prompt; load_skill bound to model in native tool-calling mode
- Blocked: test_skill_executed_via_sub_agent skipped pending #199

**Shared infrastructure (`tests/e2e/conftest.py`)**
- CaptureChatModel: hermetic mock LLM that records inputs and replays scripted responses
- KnowledgeToolProvider: stub search tool (triggers awareness detection path)
- RealSearchKnowledgeToolProvider: real search tool backed by KnowledgeEngine (covers RAG path)
- knowledge_engine fixture: isolated fastembed + sqlite-vec engine in tmp_path

Closes #206
- conftest.py: wrap engine.aclose() in try/finally so shutdown() is always
  called even if aclose() raises (resource leak fix)
- test_knowledge_e2e.py: replace EN dash characters with ASCII hyphens in
  docstrings/comments to silence Ruff RUF002/RUF003
- test_knowledge_e2e.py: select knowledge_search_knowledge tool by name instead
  of positional index tools[0] to avoid order-dependent failures
  What was broken

  Security

  1 — Prompt injection via SKILL.md frontmatter (loader.py)
  Jinja2 delimiters ({{ }}, {% %}, {# #}) in a skill's name or description were forwarded unescaped into the system prompt template. A malicious or misconfigured
  SKILL.md could inject arbitrary template directives. Fixed by stripping matching patterns at parse time and logging a warning.

  2 — Path traversal in skill names (loader.py)
  A skill named ../../etc/passwd was accepted and later used to construct file paths during sandbox upload. Fixed by rejecting names that contain /, \, or ...

  ---
  Reliability

  3 — Concurrent sandbox creation race (opensandbox_executor.py)
  Two coroutines calling _get_or_create_interpreter for the same thread_id simultaneously could both pass the cache check and create two remote containers,
  orphaning one. Fixed with a per-key asyncio.Lock.

  4 — Skills upload failure killed the sandbox (opensandbox_executor.py)
  If _upload_skills_to_sandbox raised (e.g. transient network error), the sandbox was discarded rather than cached. On the next request a new container was
  created, leaking the first one. Fixed by catching the exception, logging a warning, and caching the sandbox regardless.

  5 — Package install not verified (registry.py)
  After generating a uv pip install / npm install step in the skill script, there was no verification that the install actually succeeded. Fixed by appending uv
  pip show <pkgs> / npm list <pkgs> so the agent surfaces install failures immediately instead of getting a confusing import error later.

  6 — Silent skill name collision (loader.py)
  When two SKILL.md files declared the same name, the later one silently replaced the earlier one. Now logs a DEBUG message identifying both paths.

  ---
  SDK / graph wiring

  7 — enable_skills / skills_folder ignored by the graph (sdk.py, cuga_lite_graph.py)
  CugaAgent(enable_skills=True, skills_folder=...) stored the values but never forwarded them into run_config[configurable]. The graph always fell back to
  settings.skills.enabled and CUGA_FOLDER, making the SDK parameters a silent no-op. Fixed for both invoke() and chat().

  8 — Double .cuga suffix (sdk.py)
  If a caller passed a path already ending in .cuga as skills_folder, the SDK appended another .cuga, directing discovery to the wrong location with no error.
  Fixed with a guard that skips the suffix when already present.

  9 — enable_shell_tool=False suppressed skills silently (prompt_utils.py)
  When advanced_features.enable_shell_tool=False, the skills block was cleared with no indication. Fixed by emitting a WARNING log before clearing it.

  10 — final_answer empty for reasoning models (sdk.py)
  Some reasoning models return content='' with the answer in additional_kwargs[reasoning_content]. The answer-extraction loop missed this field and fell through
  to the interrupt-handling path, returning an empty answer. Fixed with a fallback pass over chat_messages.

  ---
  Tests added

  Shared infrastructure — tests/e2e/conftest.py

  - CaptureChatModel — hermetic mock LLM that records all inputs and replays a scripted response queue; fails loudly if the graph makes more calls than expected
  - MinimalToolProvider — no-op tool provider base for test isolation
  - write_skill / extract_system_content — helpers used across all skill test tiers
  - real_llm fixture — builds the real configured LLM from project settings for Tier 3 tests
  - pytest_terminal_summary hook — prints a pass/fail table of all Tier 3 real-LLM results at session end

  Component & graph — tests/e2e/test_skills_e2e.py

  ┌──────┬────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Tier │           Class            │                                               Coverage                                               │
  ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ TestSkillDiscovery         │ filesystem scan, multiple skills, description preserved, pip/npm requirements parsed                 │
  ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ TestSkillRegistry          │ load_skill returns body, install preamble generated for pip/npm, missing skill raises                │
  ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ TestSkillTools             │ create_skill_tools returns a load_skill StructuredTool; calling it returns the body                  │
  ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ TestSkillBlock             │ format_available_skills_block includes all names and descriptions                                    │
  ├──────┼────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 2    │ TestSkillsGraphIntegration │ CugaLiteGraph + CaptureChatModel — skills block present in system message; load_skill bound to model │
  └──────┴────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘

  Real LLM — tests/e2e/test_skills_llm_e2e.py

  Three skills containing fabricated proprietary data (formula coefficients, internal part codes, invented system names) that the model cannot know from training.
  Each positive test asserts the expected value appears; each paired negative control runs with skills disabled and asserts it is absent — confirming the skill
  is the genuine gating factor, not the model's prior knowledge.

  SDK surface — tests/e2e/test_skills_sdk_e2e.py

  - Tier 1: enable_skills / skills_folder stored on agent; forwarded into configurable; .cuga not double-appended; plain workspace root gets .cuga appended
  - Tier 3: same three fabricated-data skills driven through CugaAgent.invoke() instead of the raw graph

  Real public skills — tests/e2e/test_skills_real_e2e.py

  Downloads the real Vercel agent-skills SKILL.md, invokes the agent, and asserts that taxonomy-specific rule identifiers (long enough to be unlikely from
  training alone) appear in the answer. Skips automatically when GitHub is unreachable.

  Presentation demo — tests/e2e/test_skills_presentation_e2e.py

  Downloads the real pptx skill from anthropics/skills, pre-installs pptxgenjs, provides a run_node_script tool, and asserts a non-trivially-sized .pptx file is
  produced. Output saved to tests/e2e/demo_outputs/ for manual inspection. Run with -m e2e.

  Unit — tests/unit/test_skill_loader.py (additions)

  - _copy_skills_to_workspace copies SKILL.md into the per-thread workspace
  - _copy_skills_to_workspace is a no-op when skills_enabled=False
  - SKILL.md with a path-traversal name is silently rejected

  Unit — tests/unit/test_opensandbox_executor.py

  - Concurrent _get_or_create_interpreter calls produce exactly one Sandbox.create call (lock serialises creation)
  - Upload failure logs a warning but the sandbox is still cached (no orphan container)
  - create_sandbox_tools warns when the requested skills config differs from what was active at sandbox creation time
- Reformat cuga_lite_graph.py, local/native/opensandbox executors, and skills loader
- Reformat e2e test files (test_skills_e2e, test_skills_llm_e2e, test_skills_sdk_e2e)
- Reformat unit test files (test_opensandbox_executor, test_skill_loader)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py (1)

282-302: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing _skills_config population — skills overrides are never applied.

Unlike LocalSandboxExecutor and NativeSandboxExecutor, this executor's create_sandbox_tools() does not accept cuga_folder or skills_enabled parameters and never sets self._skills_config[key]. When _get_or_create_interpreter() later reads _skills_config.get(key, (None, None)) at line 164, it always gets the default, so caller-supplied skills configuration is ignored.

The unit tests work around this by manually setting executor._skills_config["thread-A"] = (...) before calling the interpreter method, but real callers won't do this.

🐛 Proposed fix to add skills config parameters and storage
     def create_sandbox_tools(
         self,
         thread_id: Optional[str] = None,
+        cuga_folder: Optional[str] = None,
+        skills_enabled: Optional[bool] = None,
     ) -> list[StructuredTool]:
         """Return the run_command StructuredTool bound to thread_id.

         Filesystem tools (read/write/list/edit/...) now come from the
         consolidated ``filesystem`` package via a ``RemoteSandboxBackend``
         (wired in ``cuga_lite_graph``); they are no longer produced here.
         """
+        key = thread_id or "_default"
+        self._skills_config[key] = (cuga_folder, skills_enabled)
         return [
             StructuredTool.from_function(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`
around lines 282 - 302, create_sandbox_tools currently ignores caller-provided
skills overrides so _get_or_create_interpreter always sees defaults; change
create_sandbox_tools(self, thread_id: Optional[str] = None) to accept the skills
params (e.g. add cuga_folder: Optional[str]=None, skills_enabled:
Optional[bool]=None), compute the key used by _get_or_create_interpreter (same
thread_id logic), and store the tuple into self._skills_config[key] =
(cuga_folder, skills_enabled) before returning the StructuredTool; keep the
StructuredTool.from_function call using self.create_run_command_tool(thread_id)
unchanged so only the skills config storage is added.
🧹 Nitpick comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py (1)

91-97: 💤 Low value

Mutable class-level dictionaries are shared across all instances.

These dictionaries are defined at class level, meaning all OpenSandboxExecutor instances share the same cache. While this may be intentional for singleton-style usage, it's fragile if multiple instances are created. Consider using ClassVar type hints to make the intent explicit, or initialize these in __init__ if instance isolation is needed.

💡 Suggested type annotation to document intent
+from typing import Any, Callable, ClassVar, Optional
+
 class OpenSandboxExecutor(RemoteExecutor):
     """Provides a persistent sandbox with shell + filesystem tools."""

     # Interpreter cache: thread_id -> CodeInterpreter instance
-    _sandboxes: dict[str, Any] = {}
+    _sandboxes: ClassVar[dict[str, Any]] = {}
     # Per-thread skills config requested on the current invocation
-    _skills_config: dict[str, tuple[Optional[str], Optional[bool]]] = {}
+    _skills_config: ClassVar[dict[str, tuple[Optional[str], Optional[bool]]]] = {}
     # Skills config that was actually applied when the sandbox was created
-    _active_skills_config: dict[str, tuple[Optional[str], Optional[bool]]] = {}
+    _active_skills_config: ClassVar[dict[str, tuple[Optional[str], Optional[bool]]]] = {}
     # Per-key asyncio locks to prevent concurrent sandbox creation for the same thread
-    _locks: dict[str, asyncio.Lock] = {}
+    _locks: ClassVar[dict[str, asyncio.Lock]] = {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`
around lines 91 - 97, The class-level mutable dicts _sandboxes, _skills_config,
_active_skills_config, and _locks on OpenSandboxExecutor are shared across all
instances; either make the intent explicit by annotating them as ClassVar[...]
(to document shared, singleton-style state) or move their initialization into
the instance __init__ (self._sandboxes, self._skills_config,
self._active_skills_config, self._locks) to ensure instance isolation; update
type hints accordingly and ensure all methods that reference these symbols use
self.<name> when you switch to instance attributes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 282-302: create_sandbox_tools currently ignores caller-provided
skills overrides so _get_or_create_interpreter always sees defaults; change
create_sandbox_tools(self, thread_id: Optional[str] = None) to accept the skills
params (e.g. add cuga_folder: Optional[str]=None, skills_enabled:
Optional[bool]=None), compute the key used by _get_or_create_interpreter (same
thread_id logic), and store the tuple into self._skills_config[key] =
(cuga_folder, skills_enabled) before returning the StructuredTool; keep the
StructuredTool.from_function call using self.create_run_command_tool(thread_id)
unchanged so only the skills config storage is added.

---

Nitpick comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 91-97: The class-level mutable dicts _sandboxes, _skills_config,
_active_skills_config, and _locks on OpenSandboxExecutor are shared across all
instances; either make the intent explicit by annotating them as ClassVar[...]
(to document shared, singleton-style state) or move their initialization into
the instance __init__ (self._sandboxes, self._skills_config,
self._active_skills_config, self._locks) to ensure instance isolation; update
type hints accordingly and ensure all methods that reference these symbols use
self.<name> when you switch to instance attributes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7fa69b50-f1ff-4626-a376-59fb19beb610

📥 Commits

Reviewing files that changed from the base of the PR and between 7e2734f and 3cac208.

📒 Files selected for processing (17)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/local/local_sandbox_executor.py
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/native/native_sandbox_executor.py
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
  • src/cuga/backend/skills/loader.py
  • src/cuga/backend/skills/registry.py
  • src/cuga/sdk.py
  • tests/e2e/__init__.py
  • tests/e2e/conftest.py
  • tests/e2e/demo_outputs/.gitignore
  • tests/e2e/test_skills_e2e.py
  • tests/e2e/test_skills_llm_e2e.py
  • tests/e2e/test_skills_presentation_e2e.py
  • tests/e2e/test_skills_real_e2e.py
  • tests/e2e/test_skills_sdk_e2e.py
  • tests/unit/test_opensandbox_executor.py
  • tests/unit/test_skill_loader.py
💤 Files with no reviewable changes (10)
  • tests/e2e/demo_outputs/.gitignore
  • tests/unit/test_opensandbox_executor.py
  • tests/unit/test_skill_loader.py
  • src/cuga/sdk.py
  • tests/e2e/test_skills_presentation_e2e.py
  • tests/e2e/test_skills_sdk_e2e.py
  • tests/e2e/test_skills_llm_e2e.py
  • tests/e2e/conftest.py
  • tests/e2e/test_skills_real_e2e.py
  • tests/e2e/test_skills_e2e.py
✅ Files skipped from review due to trivial changes (1)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py

- Add cuga_folder and skills_enabled optional params (matches local/native)
- Store (cuga_folder, skills_enabled) into _skills_config[key] so that
  _get_or_create_interpreter reads the caller-supplied overrides instead
  of always falling back to settings defaults

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`:
- Around line 294-295: When updating _skills_config for a given key (computed
from thread_id or "_default"), first check if a live interpreter exists in
self._interpreters for that same key and compare its active config to the
incoming (cuga_folder, skills_enabled); if they differ, emit a warning that the
new config will be ignored until release_sandbox() is called (so callers know
the per-call override won’t take effect). Implement this check alongside the
existing assignment in the same place where _skills_config is updated,
referencing _get_or_create_interpreter, _interpreters, _skills_config, and
release_sandbox in the message so the log is clear and matches the tested
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51b9f8ed-9ea4-479c-9814-31377c333be9

📥 Commits

Reviewing files that changed from the base of the PR and between 3cac208 and f675da3.

📒 Files selected for processing (1)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py

Comment on lines +294 to +295
key = thread_id or "_default"
self._skills_config[key] = (cuga_folder, skills_enabled)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Warn when a live sandbox's skills config changes.

Line 295 updates _skills_config, but _get_or_create_interpreter() returns the cached live interpreter on Lines 131-135 before any new config can be applied. That means a different cuga_folder/skills_enabled is silently ignored until release_sandbox() is called, which breaks the per-call override behavior and the stale-config warning path exercised in tests/unit/test_opensandbox_executor.py:251-278.

Suggested fix
         key = thread_id or "_default"
-        self._skills_config[key] = (cuga_folder, skills_enabled)
+        requested = (cuga_folder, skills_enabled)
+        active = self._active_skills_config.get(key)
+        if key in self._sandboxes and active is not None and requested != active:
+            logger.warning(
+                "[OpenSandboxExecutor] Skills config changed for a live sandbox; "
+                "call release_sandbox() before requesting different "
+                "cuga_folder/skills_enabled values."
+            )
+        self._skills_config[key] = requested
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/opensandbox/opensandbox_executor.py`
around lines 294 - 295, When updating _skills_config for a given key (computed
from thread_id or "_default"), first check if a live interpreter exists in
self._interpreters for that same key and compare its active config to the
incoming (cuga_folder, skills_enabled); if they differ, emit a warning that the
new config will be ignored until release_sandbox() is called (so callers know
the per-call override won’t take effect). Implement this check alongside the
existing assignment in the same place where _skills_config is updated,
referencing _get_or_create_interpreter, _interpreters, _skills_config, and
release_sandbox in the message so the log is clear and matches the tested
behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Skills pipeline — 10 reliability, security, and SDK wiring issues [Test] Add e2e tests for skills components

1 participant