feat: slash commands and skills (/help, /clear, /skills, /<skill>) with autocomplete + reasoning UX by illeatmyhat · Pull Request #230 · cuga-project/cuga-agent

illeatmyhat · 2026-05-16T04:08:40Z

Summary

Adds first-class slash-command support to CUGA: three built-ins (/help, /clear, /skills), invocation of repo-defined skills (e.g. /echo), an autocomplete dropdown, embedding-based suggestions for typos, and an allowed-tools whitelist enforced at the HITL approval gate. The skill invocation surfaces as a "Skill invoked: /" step inside the assistant message's reasoning panel — toggle relabelled "Show details" — instead of a separate chip bubble.

For human reviewers

Tell your agent to read this PR and execute the instructions in the "For your agent reviewer" section below.

Screenshots

Autocomplete dropdown on `/`	`/help`

`/skills`	`/echo hello world` — reasoning panel expanded

Unknown `/sumarize` → suggestion chips	`/clear`

What's in this PR

Backend — src/cuga/backend/slash_commands/ (parser, registry, dispatcher, message synthesis, $ARGUMENTS substitution, command resolver, allowed-tools enforcement, three built-ins). Stream events SlashSkillInvoked, SlashSuggestions, ThreadIdChanged. ChatHistoryMessage discriminated-union rehydration so AIMessage.tool_calls / ToolMessage.tool_call_id survive LangGraph checkpoints. SSE envelope now always wraps with event: + data: — fixes a bug where non-Answer events under OutputFormat.DEFAULT leaked raw JSON glued to the next event.
Frontend — Carbon AI Chat composer hardened against contenteditable + shadow-DOM + post-submit node-replacement (composerTextarea.ts). Autocomplete dropdown (SlashCommandDropdown.tsx) reopens after submit and yields Enter when args are typed. Suggestion chips (SlashChips.tsx). Skill invocation rendered as a reasoning step (customSendMessage.ts + customLoadHistory.ts). Carbon's reasoning toggle relabelled to "Show details" via MutationObserver in CarbonChat.tsx.
Tests — Python unit suites for stream envelope, ChatHistoryMessage rehydration, every slash module, and the Langfuse span contract. Playwright e2e (tests/e2e/slash-chips.spec.ts) for both chip types, history replay, /clear rotation, and the "Show details" label.

For your agent reviewer

You're being asked to verify that the slash-command pipeline works end-to-end on the user's machine. Do not trust the unit tests alone — the failure modes here are visual (the chip leaking to its own bubble, the dropdown not reopening, the SSE wire format leaking into the chat bubble), and they only show up when a real browser drives a real server.

0. Environment prereqs

LiteLLM endpoint reachable. CUGA's planner uses LiteLLM (Azure/gpt-4.1 by default). If you're behind a corporate VPN that fronts LiteLLM, connect it before doing anything else — without VPN, calls hang silently rather than failing. Confirm with curl -sf $OPENAI_BASE_URL/health || echo "VPN?" if you have a health route, or just try a one-shot agent.invoke("hello") and watch for a response within ~20s.
uv installed. Run uv sync once.
Frontend bundle built. Run bash src/frontend_workspaces/frontend/build.sh — pnpm run build alone does not deploy. The script does pnpm run build and copies dist/ into src/cuga/frontend/dist/, which is what the FastAPI server actually serves.

Echo skill on disk. Create .agents/skills/echo/SKILL.md (the .agents/ directory is gitignored, so it won't be in your checkout) with the following:

---
name: echo
description: Echoes its arguments back verbatim. A minimal skill for verifying the slash-command pipeline end-to-end.
---

You are an echo. Respond with the user's arguments verbatim — exactly as they wrote them, nothing else.

Do not add a confirmation. Do not add a preamble. Do not add a timestamp. Do not quote the args. Do not call any tools. Do not plan further steps.

If the user passed no arguments, respond with the single word: `(empty)`.

Server up. .venv/bin/cuga start demo_skills — wait until curl -sf http://127.0.0.1:7860/api/commands returns 200.

1. Wire format sanity check (curl)

curl -sN -X POST http://127.0.0.1:7860/stream \
  -H "Content-Type: application/json" \
  -H "X-Thread-ID: $(uuidgen)" \
  -d '{"query":"/echo hello world"}' | head -40

Expected: each event is one event: <Name>\n followed by one data: ...\n followed by a blank line. If you see data: event: or data: data: anywhere in the output, the SSE envelope is being double-wrapped (regression of main.py:1771).

2. Playwright live suite

Mirror the patterns in src/frontend_workspaces/frontend/tests/e2e/slash-chips.spec.ts (stubbed) but point the new spec at the live server. Critical patterns:

Carbon AI Chat 1.6 ships the composer as a contenteditable host with role="textbox", not a <textarea>. Read with textContent, set with the patterns in composerTextarea.ts.
Multiple composers can co-exist in the shadow DOM after a submit; prefer the visible one (getBoundingClientRect().width > 0).
Carbon's submit clears the composer; the autocomplete dropdown's MutationObserver must re-resolve when the live composer node changes identity. Test this by submitting /skills then typing / again and asserting the dropdown reopens.
The slash dropdown intercepts Enter to "accept" the highlighted command. When the user has already typed args (^/<name>\s+), Enter should fall through to Carbon's submit. Test /echo hello world with a single Enter press and assert the outbound /stream payload contains "hello world".

The minimum acceptance suite:

Scenario	Assertion
Dropdown opens on `/`	`.cuga-slash-dropdown` visible, lists `/clear`, `/help`, `/skills`, `/echo`
`/help`	Response text contains `/help`, `/clear`, `/skills`
`/skills`	Response text contains "Installed skills" + "echo" + the skill description
`/clear`	`X-Thread-ID` on the next `/stream` request differs from the prior one (capture via `page.on("response", ...)`)
`/echo hello world` (single Enter)	(a) no `.cuga-slash-skill-chip` element anywhere; (b) at least one assistant bubble equals exactly `"hello world"` (filter the user message out first); (c) the rendered text does not contain the raw SSE wire format `event:` or `data:`
`/sumarize`	`.cuga-slash-suggestions` visible with at least one chip; no bubble text matches `/Unknown command:/` (the suppression contract — chip's own header text is `Unknown command <name>. Did you mean:` without colon, so the colon-anchored regex distinguishes them)
Dropdown re-opens after `/skills`	Submit `/skills`, wait for response, type `/`, assert dropdown visible again

Sample spec scaffolding lives in /tmp/full-suite.spec.ts, /tmp/dropdown-reopen.spec.ts, and /tmp/echo-args.spec.ts on the originating dev machine — recreate them in a /tmp/ directory of your own. Use pnpm exec playwright test --config <your-config>.ts --reporter=line from src/frontend_workspaces/frontend/.

3. Known pre-existing failures (not regressions)

Both fail identically on main; do not chase:

tests/unit/test_skill_loader.py::test_skill_registry_load_skill_emits_install_steps_for_requirements
tests/unit/test_find_tools_exception.py::test_find_tools_composes_query_with_initial_user_message

4. Report

Run the live suite, paste the pass/fail tally, and flag any of the following that you observe — these were the bugs the development session caught and would be the most informative regressions:

Dropdown does not reopen after submit
/echo hello world arrives at the server with empty raw_args
/clear does not rotate the thread id
Suggestion chips never render for an unknown command
Skill invocation appears as a separate ⚡ /<name> chip bubble (should be a reasoning step)
"Show Reasoning" appears instead of "Show details"
Raw event: … / data: … text appears inside any assistant bubble

Test plan

Python unit suites green (uv run pytest tests/unit/)
Frontend e2e stubbed suite green (pnpm exec playwright test tests/e2e)
Live-server Playwright suite green (9/9 against cuga start demo_skills)
Curl-confirmed SSE envelope shape
Reviewer's agent re-ran the live suite on their machine

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Full slash-command system: built-ins (/help, /skills, /clear), skill argument substitution, semantic suggestions, and GET /api/commands.
- Embedding-backed command resolver and slash dispatch integrated into SDK/agent flows.
Frontend
- Autocomplete dropdown, clickable suggestion chips, composer integration, thread-rotation support, and UI styling.
Tool Approval
- Per-skill allowed-tools whitelist enforced alongside existing approval flow.
Streaming
- Robust SSE parsing/formatting and safer stream handling.
Tests
- Extensive unit + Playwright e2e coverage.
Chores
- .gitignore updated to skip frontend test artifacts.

Introduce a shared slash_commands package that both POST /stream and the SDK's CugaAgent.invoke() route through, plus a plugin-style builtins package auto-discovered at startup. The first built-in, /help, lists every command in the merged (built-ins + skills) registry. GET /api/commands returns the same registry for the upcoming autocomplete UI. Unknown slash commands return an "Unknown command" response instead of silently flowing to the planner; slice #20 will replace that with embedding- based suggestions. Skill-kind dispatch is detected here so slice #17 can wire in the synthetic load_skill tool-call pair without changing this seam. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… #17 Three backend slices on the scaffold from #14: - `/clear` (#15) mints a fresh thread id and emits ThreadIdChanged on the stream so the frontend can adopt it as the next X-Thread-ID. The old thread remains visible in /api/conversation-threads. SDK surfaces the new id via InvokeResult.thread_id. - `/skills` (#16) lists every discovered skill with its description and any declared requirements. SkillRegistry gains an `entries()` accessor so the built-in doesn't reach into private state. - Slice #17 wires user-side skill invocation. The dispatcher loads the skill body via registry.load_skill() and produces a 4-message pair (HumanMessage / AIMessage with load_skill tool_call / ToolMessage with the wrapped body / optional trailing HumanMessage for args). The synthetic AIMessage carries additional_kwargs={invoked_via: 'slash', raw_input, resolved_name} so slice #22 can collapse the pair into a chip and the persistence layer can distinguish slash-invoked from model-initiated load_skill calls. The pure function lives in slash_commands/message_synthesis.py and is exhaustively unit-tested. Server (event_stream) and SDK (CugaAgent.invoke) both consume DispatchResult.injected_messages by prepending the 4 messages to chat_messages before the planner runs. Args are appended verbatim to the wrapped body in this slice — full $ARGUMENTS-style substitution lands in slice #19, which will substitute on the raw SKILL.md body BEFORE the install/sandbox wrapper to avoid hitting `$` in shell commands inside the wrapper text. Follow-ups still owed by this PR series: - Langfuse span emission for parse_and_dispatch (currently a structured log line; PRD calls for a real span — easier to land alongside slice #20 where unknown-command attributes also need recording). - Frontend handler for ThreadIdChanged event (small touch in customSendMessage.ts; ships with slice #18). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Render a slash-command dropdown over the Carbon AI Chat composer when the user types '/' as the first non-whitespace character. The dropdown fetches GET /api/commands on open, lists each built-in or skill with its description and (when present) argument hint, and distinguishes the two kinds with a chip-style tag. Navigation: arrow up/down to move, Enter to accept, Escape to dismiss; click outside or blur the composer to dismiss. Selecting a command inserts '/<name> ' (trailing space) into the textarea. Typing past the slash filters the list by case-insensitive name prefix. The dropdown does not appear when the slash is not the first non-whitespace character. The Carbon composer textarea lives inside a shadow root, so the component walks the shadow tree to find the <textarea>, listens for input/keydown/blur on it directly, and renders the dropdown via a React portal anchored above the textarea using fixed positioning. The portal target is the chat wrapper div in CarbonChat.tsx. Styles ship in CarbonChat.css with a prefers-color-scheme dark variant so the dropdown matches the existing chat appearance in both modes. Not in this slice (deferred to follow-up touches): - Frontend handler for the ThreadIdChanged event emitted by /clear in slice #15 (small change in customSendMessage.ts). - Frontend rendering of the slash-invocation message pair as a collapsed chip — that is slice #22. - Clickable suggestion chips for unknown commands — slice #23. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e model #17 Slice #17 verification surfaced that `call_model` in `cuga_lite_graph.py` silently skips any non-Human/non-AI message and drops `tool_calls` on AIMessages. With slash skill invocation, the planner received only the user turns and an empty assistant turn — the ToolMessage carrying the wrapped skill body never reached the LLM, so the model could not act on the loaded skill. Two surgical changes: 1. When an AIMessage has `tool_calls`, emit the OpenAI Chat Completions `tool_calls` shape (id / type=function / function.name / function.arguments stringified) so the model sees its prior call. Pure-content AIMessages keep the existing pass-through path. 2. Add a `ToolMessage` branch that emits `{role: "tool", tool_call_id, content}` so the wrapped skill body is paired by id to the AIMessage above it. This is the standard format both Azure OpenAI and Anthropic Bedrock proxies expect through LiteLLM. This is a general improvement to `call_model` — any LangChain-style tool-call pair (not just slash-synthesized ones) now reaches the model correctly. Slice #17's pre-stuffed tool-call pair was the trigger to find the gap. End-to-end LLM verification of /<skill> dispatch is still owed; the local LiteLLM proxy hangs on every call in this environment (including baseline non-slash invocations), so the timeout is environmental, not message-shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Slice #19 — Claude-Code-style argument substitution on skill bodies: * New `arg_substitution.py` runs four passes over the raw SKILL.md body before the install/sandbox wrap: `${N:-default}`, `${N}`, `$N`, `$ARGUMENTS`. POSIX-y but intentionally narrower than `envsubst` — leftover `$VARS` are preserved verbatim. * `SkillEntry.allowed_tools` parsed + stored from the SKILL.md frontmatter (enforcement deferred to slice #21 — HITL). * `dispatcher` now hands `parsed.raw_args` into `skill_registry.load_skill` so substitution runs against the raw body, not the wrapped one. The legacy "ARGUMENTS: …" trailer was dropped. Slice #20 — unknown-command embedding resolver: * New `command_resolver.CommandResolver` builds a lazy embedding index over every known command (built-ins + skills) and ranks the top-N semantic matches for a mistyped slash command. Backed by a minimal in-process `_InMemoryEmbeddingStore` — unknown-command resolution doesn't need cross-process persistence. * Dispatcher integration: `parse_and_dispatch` accepts a `command_resolver_factory`, `_resolve_unknown` enriches the unknown result with `DispatchResult.suggestions`, and every dispatch emits a `slash_command` telemetry record (structured log + best-effort Langfuse span, gated on `advanced_features.langfuse_tracing` so the client never constructs when tracing is off). * Both stream callers (`server/main.py` + `sdk.py`) thread the factory through. Lifespan prefetches the embedding model so the first unknown command doesn't pay a cold FastEmbed download mid-request. * Server stream contract: emits `SlashSuggestions` (suggestion chips for unknown commands, slice #23) and `SlashSkillInvoked` (skill chip, slice #22) into both the live stream and the buffered history. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Render slash command outcomes as Carbon chips instead of inline text: * Slice #22 (`SlashSkillInvoked`): the synthesized load_skill / ToolMessage / Human pair from a recognized `/skill arg` invocation collapses into one compact chip showing the resolved skill name and any args. The chip is emitted both during the live turn and replayed from buffered history. * Slice #23 (`SlashSuggestions`): when the resolver returns ranked matches for an unknown command, render clickable correction chips. Clicking a chip writes the suggested command into the composer (reusing the composer-textarea utility lifted out of `SlashCommandDropdown`). * Chips are routed through Carbon's `renderUserDefinedResponse` — deliberately different from the shadow-DOM-decoration pattern slice #18 uses for the composer dropdown; both patterns coexist intentionally (see `SlashChips.tsx` docstring). * Composer textarea fix: the shared `composerTextarea.ts` traversal now matches both `<textarea>` and `contenteditable[role="textbox"]`, since `@carbon/ai-chat@1.6.0` ships the latter. **This also fixes slice #18's autocomplete-dropdown write path, which had never been browser-tested and was silently a no-op against the new contenteditable composer.** Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Browser-driven verification for the chip rendering paths (slices #22 + #23). Boots the production build in headless Chromium, stubs every backend endpoint via `page.route`, and asserts both the live-turn render and the history-reload replay. Deliberately opt-in: * Playwright is **not** in `devDependencies` — the README walks through installing it on demand. The `test:e2e` script in `package.json` errors clearly when Playwright isn't installed. * The harness uses a **zero-dep Node static server** (`static-server.cjs`) instead of `serve` / `webpack-dev-server`, both of which pull in `path-to-regexp@6` which is broken under Node 24. * Route-precedence gotcha: most-recently-registered route wins, so the spec registers the broad `**/api/**` catch-all FIRST and specific stubs after it take priority. Documented in the spec for future hands. Also adds `test-results/` and `playwright-report/` to `.gitignore` so the Playwright bookkeeping output never lands in commits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Whitespace/line-join cleanups in pre-existing slice files surfaced by `uv run ruff format`. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Slice #21 enforces a skill's frontmatter ``allowed-tools`` whitelist at the tool-approval gate, the seam closest to where the model actually invokes tools. The whitelist applies only to the turn that immediately follows a slash-skill dispatch (per-invoke RunnableConfig scope). Semantics: * ``allowed-tools`` absent in SKILL.md frontmatter → no restriction (status quo; matches Claude Code). * ``allowed-tools: []`` (explicit empty list) → allow nothing, every bare function call (outside Python builtins) trips the gate. * ``allowed-tools: [a, b]`` → only ``a`` and ``b`` may be called freely; anything else trips the gate. When the gate trips, the call drops into the **existing HITL approval interrupt flow** so the user can still override the whitelist on a case-by-case basis (PRD #13: per-tool approval flow always applies; the whitelist is an upper bound, not a hard rejection). Wiring: * ``SkillEntry.allowed_tools`` changes type to ``Optional[tuple[str, ...]]`` so the loader can distinguish "key absent" (``None``) from "explicit empty" (``()``) — the parser previously collapsed both to ``()``. * Dispatcher propagates the whitelist onto ``DispatchResult.allowed_tools``. * Both callers (server ``AgentLoop`` + SDK ``invoke``) thread it through to ``RunnableConfig.configurable['skill_allowed_tools']``. * ``ToolApprovalHandler.check_and_create_approval_interrupt`` runs the whitelist check first. The slice #21 gate runs regardless of ``settings.policy.enabled``; the existing ToolApproval-policy check is still gated on it (moved the gate inside the handler so the whitelist isn't accidentally suppressed when the policy system is off). * Detection uses ``ast.walk`` for bare ``Call(func=Name)`` nodes — method calls (``response.get(...)``) and lambdas are correctly skipped. Python builtins are safelisted so skill authors only enumerate domain tools. 14 new unit tests (pure detection helper + handler wiring). Existing slash + skill suite is green; no regressions against baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Slice #15's backend already mints a fresh ``thread_id`` on ``/clear`` and surfaces it through a ``ThreadIdChanged`` SSE event, but the frontend was ignoring it — the next outbound request still carried the pre-clear ``X-Thread-ID``, so the "new conversation" effect didn't actually take effect until a page reload. The handler in ``customSendMessage`` now parses the event and calls a new ``setThreadId`` setter exported from ``CarbonChat``. The current turn keeps its already-captured thread_id for any trailing stream events; the next ``customSendMessage`` call reads the rotated id via ``getOrCreateThreadId``. The wrapped ``customSendMessage`` in ``CarbonChat`` re-reads ``currentThreadId`` after each turn and fires the existing ``onThreadChange`` prop accordingly, so embedders see the rotation. ``ThreadIdChanged`` is intentionally **not** buffered into the saved stream, so history replay does not need a corresponding handler — re-applying a past clear to the currently-viewed thread would be incorrect. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two related cleanups around the slice-#22/#23 e2e harness: * Add ``@playwright/test`` to ``frontend/package.json`` ``devDependencies`` (was previously left out, with a README walking through ``pnpm add -D``). npm/pnpm have no exact equivalent of pip's ``[extras]`` opt-in groups, so the standard JS-ecosystem placement is ``devDependencies`` for the JS package + a deliberate ``pnpm exec playwright install chromium`` for the ~150MB browser binary (documented in ``tests/e2e/README.md``). * Remove the unused ``puppeteer``, ``jest-image-snapshot``, and ``@types/jest-image-snapshot`` from the workspace-root ``devDependencies``. Grep finds zero imports of any of these across the repo; no Jest config exists, no other ``.test.ts``/``.spec.ts`` files anywhere. They look like placeholders for a visual-regression suite that was never wired up. A future ``feat(test): visual regression`` PR can reintroduce them when an actual snapshot test lands. Playwright structurally fits this spec better (layered ``page.route`` precedence, shadow-DOM-piercing ``getByRole`` locators against Carbon's ``contenteditable`` composer, locator auto-wait against streamed UI). Puppeteer's lane is visual regression / headless automation outside testing, not the kind of functional e2e the slash-chip spec needs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the missing slice #15 verification: `/clear` rotates the server-side thread_id, the frontend adopts it via the new `setThreadId` setter, and the **next** outbound request carries the new `X-Thread-ID` header. The Playwright route handler intercepts both `/stream` calls, captures the `X-Thread-ID` request header on each, and asserts the second-call header equals the rotated id from the first-call's `ThreadIdChanged` SSE event. A regression where the SSE event is parsed but the setter is never called would now fail this test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d e2e Carbon AI Chat pumps every message body into a hidden ``aria-live="polite"`` announcement region for screen readers, so ``getByText("Conversation cleared.")`` resolved to two elements and tripped Playwright's strict-mode locator check. Targeting ``.last()`` reliably picks the visible chat bubble. Also picks up the lockfile reconciliation from commit 2b19e8d (the puppeteer/jest-image-snapshot removal) — ``pnpm install`` regenerated ``pnpm-lock.yaml`` to drop those subtrees. The lockfile delta is large (~667 deletions) but is purely the consequence of the prior dep cleanup, not a test-fix change. All 4 e2e tests now green locally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Slice #20's ``_emit_slash_telemetry`` constructs a best-effort Langfuse span gated on ``advanced_features.langfuse_tracing``. A real Langfuse backend isn't required to verify the contract — patching ``langfuse.get_client`` to a spy captures the exact ``start_observation`` call and lets us assert on it deterministically. 5 tests cover: * The tracing-disabled gate — confirms the dispatcher does not even *import* the Langfuse client when ``langfuse_tracing`` is off, which prevents the noisy "client initialized without public_key" warning that pollutes production logs. * Span name + shape (``name="slash_command"``, ``as_type="span"``, input/output/metadata) for ``builtin`` dispatches. * Args propagation for ``skill`` dispatches (``input.args`` carries ``raw_args``, ``output.resolved_name`` carries the skill name). * Top-suggestion attribution for ``unknown`` dispatches (PRD #13 story 34: mine telemetry for skills users wished existed). * Telemetry failures are swallowed — a Langfuse explosion must never break dispatch. The full network/UI verification still requires a real Langfuse instance, but that's now scoped to "does Langfuse render what we sent" rather than "do we send the right thing". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The chip CSS already carries ``@media (prefers-color-scheme: dark)`` rules (``CarbonChat.css:703-755``) that override the light-mode background (``#f6f8fa``) with a dark equivalent (``#21262d``). The risk this test guards against is a future refactor that hardcodes a light value outside the media query and silently regresses dark-mode appearance. Uses Playwright's ``page.emulateMedia({ colorScheme: "dark" })`` to flip the browser's color scheme before navigating, then asserts the chip summary's computed ``background-color`` equals ``rgb(33, 38, 45)`` (``#21262d``). Exact-value assertion rather than "not light" — a partial regression (dropped dark rule, mistyped selector) shows up as a meaningful failure instead of a vague "looks wrong". 5/5 e2e tests green locally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…across graph checkpoints Slice #17 was the first thing in this codebase to put a tool_call-bearing AIMessage and a matching ToolMessage into ``AgentState.chat_messages`` for persistence. That surfaced a dormant Pydantic-v2 + LangGraph bug: 1. ``chat_messages: Optional[List[BaseMessage]]`` declared the field as the parent class. Pydantic v2 serializes by the *declared* type, so ``model_dump()`` dropped every subclass-specific field — ``AIMessage.tool_calls`` and ``ToolMessage.tool_call_id`` vanished from the dict the checkpointer stored. 2. On rehydration, Pydantic created plain ``BaseMessage`` instances (no discriminated union), so cuga_lite's reader at ``call_model:1908+`` hit the ``else`` branch and silently skipped each ToolMessage with "Skipping message ... with unknown type". End-user impact: a slash-dispatched skill's ``load_skill`` stanza vanished after the first turn. Asked "what was the last tool you called?" on turn 2, the model would say "I have not called any tools yet in this conversation" — the loaded skill body was gone. Fix combines two changes on ``AgentState`` (and the mirrored ``CugaLiteState``): * Wrap the annotation with Pydantic v2's ``SerializeAsAny[BaseMessage]`` so ``model_dump()`` serializes each message by its *runtime* subclass instead of the declared parent. ``tool_calls`` and ``tool_call_id`` now survive the dump. * Add a ``mode="before"`` field validator (``_rehydrate_message_subclasses``) that walks the incoming list and reconstructs proper subclass instances from any dict-shaped or BaseMessage-shaped items, keyed on the ``type`` discriminator. Existing subclass instances pass through unchanged. Verified live against the LiteLLM endpoint with a 2-turn flow: Turn 1: ``/kwargs-check`` → model emits ``KWARGS-OK`` (slice #17 happy path). Turn 2: "What was the last tool you called?" → model now correctly answers "The last tool I called was ``load_skill`` with the argument ``{\"name\":\"kwargs-check\"}``." No more "Skipping message" warnings. 7 new regression tests in ``tests/unit/test_agent_state_message_rehydration.py`` cover both write-side (model_dump preserves the subclass fields) and read-side (round-trip restores proper subclasses) plus the helper's edge cases (existing subclasses pass through, plain BaseMessage gets re-promoted, None/empty handled). Pre-existing ``test_skill_registry_load_skill_emits_install_steps_for_requirements`` still fails on both branches (unrelated to slice #17 — flagged separately). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- ``StreamEvent.format`` previously short-circuited to bare ``self.data`` for non-``Answer`` events under ``OutputFormat.DEFAULT``, leaving slash events without an ``event:`` prefix and gluing the next event to the raw JSON. - Always wrap with ``event: <name>\n`` + one ``data:`` line per logical line of the payload (preserving blank lines per the SSE spec) so the client can classify each event reliably. - Add ``StreamEvent.parse`` as the round-trip inverse so tests and the history replay path share one shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ator - The agent-loop generator already yields fully-formed ``event: <name>\ndata: <body>\n\n`` blocks; feeding the block back into a fresh ``StreamEvent(data=event).format(...)`` wrapped it a second time, leaking the inner wire format into the rendered chat bubble. - Under ``OutputFormat.DEFAULT`` emit ``event`` verbatim; keep the wrap-and-prepare path only for ``OutputFormat.WXO``, whose ``format_event`` partitions on ``data:`` and so genuinely expects a pre-formatted block as input. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Replace the SerializeAsAny + manual validator pair with a Pydantic discriminated union over ``type`` so HumanMessage / AIMessage / ToolMessage rehydrate to their concrete classes automatically when state is loaded from the LangGraph checkpoint. - Chunks are deliberately rejected: no production path persists chunks, and accepting them would obscure tool_call / tool_call_id loss across the checkpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Drop the local ``SerializeAsAny[BaseMessage]`` field + manual ``_rehydrate_message_subclasses`` validator; reuse the typed ``ChatHistoryMessage`` alias from ``agent_state`` so CugaLite's ``chat_messages`` rehydrates to concrete subclasses through the shared discriminated union. - Adjacent comment cleanup on the OpenAI tool_calls / ToolMessage mapping (no behaviour change). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- The slash-commands feature is no longer being built slice-by-slice; references to PRD #13 / slice #14 / slice #17 / slice #19 / slice #20 / slice #21 / slice #22 made the modules read like a build log. - Replace every reference with self-describing prose so the comments stand on their own without the slice index. - Pure documentation; no behaviour change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- ``findComposerInput`` now prefers the *visible* (non-zero-rect) composer when multiple candidates are reachable through the shadow DOM. Carbon AI Chat sometimes leaves the previous composer node attached for a beat after submit, and binding listeners to that orphan caused the slash autocomplete dropdown to never reopen. - Expose ``isComposerStale`` (detached or zero-rect) so the dropdown's MutationObserver can decide whether to re-resolve its reference instead of waiting for an unmount that may never happen. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…r when args present - After a slash command is submitted Carbon replaces the composer contenteditable host; the dropdown's MutationObserver now uses the shared ``isComposerStale`` check and re-resolves the reference when the live node changes identity, so the next ``/`` reopens the menu. - ``handleKeyDown`` lets Enter fall through to Carbon's submit handler when the typed value already matches ``^/<name>\s+`` — typing ``/echo hello world`` followed by a single Enter now submits the args intact instead of having ``acceptCommand`` rewrite the composer to ``/echo `` and drop everything after. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- The ``⚡ /<name>`` skill chip used to render as a separate USER_DEFINED message bubble below the assistant's answer; that read as visual tool-call debug clutter. - Remove the component, the ``SLASH_SKILL`` discriminator branch in ``renderCugaUserDefinedResponse``, and the related types. The skill invocation now surfaces as a "Skill invoked: /<name>" step inside the assistant message's reasoning panel — see the follow-up ``customSendMessage`` and ``customLoadHistory`` commits. - ``SlashSuggestionsChip`` is unchanged: unknown-command suggestions stay as their own bubble because they are interactive. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ti-line SSE parsing - ``SlashSkillInvoked`` no longer calls ``addMessage`` to mount a separate chip bubble; the handler now pushes a "Skill invoked: /<name>" step onto ``collectedSteps`` and fires an ``addMessageChunk`` partial update on the active streaming shell, mirroring the ``CodeAgent_Reasoning`` pattern. - ``parseCugaStream`` accumulates every ``data:`` line in an event block into ``dataLines`` before joining with ``\n``; the previous overwrite-per-line silently truncated multi-line bodies to their last line, so markdown with blank-line separators arrived clipped. - ``Answer`` finalisation suppresses the redundant plain-text "Unknown command…" fallback when a ``SlashSuggestions`` chip has already rendered. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Buffered ``SlashSkillInvoked`` events used to rehydrate as a separate USER_DEFINED chip; rebuild them as steps on the next assembled assistant message instead, so reloaded threads match the live UX. - Buffered ``SlashSuggestions`` still rehydrate as their own bubble (the chips are clickable; they don't belong in a reasoning panel). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Carbon AI Chat exposes its toggle text via Lit element properties (``openLabelText`` / ``closedLabelText``) and matching attributes; the ``strings`` prop's i18n dispatch runs after Carbon's React → Lit handoff and so does not retro-patch already-mounted toggles. - Extend the existing MutationObserver (already in place for the attachment decorations) to walk ``cds-aichat-reasoning-steps-toggle`` and ``cds-custom-*`` variants, set both the JS properties and the reflected attributes, and reapply on every Carbon rerender so live and replayed messages both pick up the new wording. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- CUGA has no app-wide dark mode today, so the ``prefers-color-scheme: dark`` overrides on ``.cuga-slash-dropdown`` / ``.cuga-slash-option`` / ``.cuga-slash-suggestion`` / ``.cuga-slash-skill-chip`` produced dark chips inside an otherwise light app. - Strip the rule bodies; leave a single comment block recording the selectors so the styles can be reintroduced verbatim when an app-wide dark mode lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Same cleanup as the source-side ``chore(slash)`` commit: remove every ``slice #N`` / ``PRD #13`` reference from docstrings and inline comments so the tests read as self-describing checks of the behaviour they actually exercise. - No behaviour change, no assertion change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- New focused unit suite for ``StreamEvent.format`` that pins the envelope shape ``event: <name>\n`` + one ``data:`` line per logical line of the payload + the blank-line terminator. - Specifically guards against the regression where non-``Answer`` events under ``OutputFormat.DEFAULT`` returned bare ``self.data`` (no ``event:`` prefix), which on the wire glued slash events to whatever came next. - Round-trips through ``StreamEvent.parse`` to lock in symmetry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- The branch is gated on slash_registry.has_skill(parsed.name), and has_skill only returns True when the same name is in the registry's _by_name dict that entry() reads from. The lookup cannot miss. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- slash_result.suggestions is a dataclass field with default_factory =list; the getattr-with-default + 'or []' fallback can never fire. - The 'and slash_result.injected_messages' clause inside the skill branch is implied by kind=="skill": synthesize_skill_invocation always returns at least three messages, so injected is invariant non-empty when kind=="skill". - StreamEvent.parse only raises ValueError; the IndexError half of except (ValueError, IndexError) was relic of the previous hand- rolled string-split parser and is now unreachable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- parse_markdown_with_frontmatter only raises ValueError (yaml.YAMLError is wrapped to ValueError inside the parser) and may bubble OSError from the underlying open() call. Catching Exception was wide enough to swallow real programming errors; narrow to the two reachable classes so unexpected exceptions surface as actual crashes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- getBoundingClientRect is a non-optional method on Element (and has been since IE5); the ?.() optional chain and the rect-is-null guard can never fire. The (el as HTMLElement) cast was also unnecessary since the method lives on Element itself. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Event.composedPath is non-optional on Event in lib.dom.d.ts and has been universally implemented since 2018. The 'typeof === "function"' fallback to [] was guarding a code path no browser this app targets can hit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- SlashSuggestionsChipData.suggestions is typed SlashSuggestion[] (non-nullable), and both producers always construct it as an array via Array.isArray(...) ? ... : []. The !data.suggestions half of the guard can never fire; the length===0 early-return stays. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Output of bash build.sh capturing the SlashCommandDropdown, SlashChips, and composerTextarea theater removals so the deployed bundle drops the unreachable optional-chain / function-type-check / null-fallback branches too. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Replace 17-line "Design" essay with a 3-line summary that names the actual constraint (EmbeddingStoreBackend protocol doesn't expose embeddings on list). - Drop the four inline `# 1./2./3./4.` breadcrumbs inside resolve(); the docstring already enumerates the steps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Replace the multi-line PRD-quoting cache rationale with a single-line comment that conveys the same information. - Drop the trailing "# only the latest registry snapshot is useful" narration after the .clear() call. - Drop the $ARGUMENTS-substitution restatement comment in the skill branch (load_skill(name, args) is self-documenting). - Drop the cross-reference comment in the skill-load except branch (the sibling branch already documents this idiom). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The "deep, pure module ... per PRD #13's testing-decisions section" paragraph was self-referential ticket metadata, not documentation that helps a reader of the module. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the multi-paragraph "Why AST instead of regex" comparison with one sentence stating what the AST walk actually matches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

logger.exception in an except branch is a standard idiom — the preceding three-line rationale was narration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…docstring - Collapse two adjacent multi-line comments narrating the WXO/DEFAULT branching (one about format_event partitioning, one about the double-envelope bug) into a single three-line comment immediately before the branch. - Reduce the get_commands docstring to a single sentence; drop the kind-discriminator restatement that the response shape already expresses. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…comment The "load_skill message quad" sentence pre-explained an injection step that lives below; the kind-routing comment above stays. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Drop the "(it's the tighter, skill-scoped contract)" parenthetical from the precedence docstring sentence. - Collapse the two adjacent comments about unconditional whitelist and gated policy into one line. - Tighten the whitelist-enforcement except comment to one line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the three-line narration about model-initiated calls with one line stating when substitution runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Collapse composer-discovery + shadow-boundary multi-line comments into single-line explanations. - Drop trivially-named-code comments ("Filter case-insensitively by name prefix.", "Clamp highlight index when filtered list shrinks."). - Drop the ARIA-effect "Safety:" paragraph that preemptively explained the observer scope; keep the APG-pattern rationale. - Tighten the cleanup comment in the ARIA effect to one line. - Tighten interval-bootstrap comment to one line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- setComposerAriaAttributes: drop "setAttribute on detached nodes is safe..." sentence (callers need not guard). - clearComposerAriaAttributes: drop "removeAttribute on detached node is a no-op" sentence (same theme). Both restate browser behaviour rather than the helper's contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two-pass audit removing comments that lie about current behaviour, restate what the code says, or carry developer-diary narration from the original implementation. ~30 sites across 15 files; no source behaviour change. - server/main.py: two comment fixes. (a) A comment claimed SlashSkillInvoked renders as a "single collapsed chip" — the chip was removed earlier in this PR and the invocation now lives in the assistant message's reasoning panel; rewrote to describe current behaviour. (b) Removed the embedding-prefetch comment that called out a "~120MB FastEmbed download" cost — the function actually dispatches to OpenAI or local based on settings.storage.embedding.provider, and the prefetch intent is already obvious from the function name being called eagerly at startup. - SlashCommandDropdown.tsx: the role-restoration mechanism was explained three times in one useEffect; collapsed to one canonical spot near the originalRoleRef. - customLoadHistory.ts / customSendMessage.ts: the suppress-next-Answer rationale appeared three times; kept the canonical explanation on the flag definition. - dispatcher.py / command_resolver.py / clear.py / arg_substitution.py: removed Design essays, PRD references, and what-comments above trivially-named operations. - allowed_tools_enforcement.py / tool_approval_handler.py / loader.py / sdk.py: dropped paragraphs that restate function docstrings or paraphrase the very next line. Verified: ruff clean on touched files; pytest 112 slash unit tests pass; no frontend rebuild needed (comments don't affect bundles). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Drop open/exec/eval/compile/__import__/breakpoint/globals/vars/locals from PYTHON_BUILTINS_SAFELIST via a new RISKY_BUILTINS deny-set. These bypassed the spirit of any file-IO-gating ``allowed-tools`` whitelist; unwhitelisted calls to them now route through HITL approval like any other gated call. - Reverse the safelist test that asserted ``open`` was auto-allowed, and add coverage that bare ``open(...)`` is flagged when not whitelisted. - Warn when a skill dispatches but ``local_state is None`` — the load_skill quad cannot be injected and the planner won't see the body. Previously silent. - Document the resume-path trust boundary: HITL approval is the gate, re-enforcing the whitelist mid-turn would not add defense and would require persisting the tuple into LangGraph state. - Clarify DispatchResult.allowed_tools as skill-only; for other kinds the field stays ``None``. - Note that later skill-search roots intentionally override earlier ones in ``discover_skills`` so the override semantics aren't mistaken for a bug on later reads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Change ``_resolver_cache`` from a misleading ``Dict[int, CommandResolver]`` to an explicit ``Optional[Tuple[int, CommandResolver]]``. The previous shape called ``.clear()`` before every insert, so it was already a size-1 cache pretending to be a many-entry one. The tuple makes the contract visible. Lock + under-lock re-check preserved. - Embed command names and descriptions as two separate vectors and rank by ``max(cos(name), cos(desc))``. Concatenating ``"<name>: <description>"`` into a single embedding biased ranking toward commands with verbose descriptions that semantically overlap with the typo. The max-of-two rule lets either half drive ranking on its own merits. - Commands with an empty description skip the second vector and rank purely on the name embedding. - Update the resolver tests: split the canned vectors so name and description halves embed independently, add coverage for description-driven ranking and the empty-description path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Escape backticks when rendering ``SlashSkillInvoked`` content in the live stream. A literal backtick in raw_input / resolved_name / raw_args would close the inline code span and break the markdown. The replay path in customLoadHistory.ts already does this; live and replay now render identically for any user input. - Cap the SlashCommandDropdown bootstrap interval at 30 attempts (~15s). In embedded hosts, shadow-rooted contexts, or content-blocked pages the textarea never resolves and the interval polled forever. After the budget we log a one-time warning and stop, so the page no longer holds a perpetual setInterval handle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Rebuilds ``src/cuga/frontend/dist/`` so the deployed bundle picks up the SlashCommandDropdown poll bound and the backtick escaping in the live SlashSkillInvoked reasoning step. Source changes landed in the preceding commit on the same branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Third comment-audit pass on the review-fix range (4eb418f..a0c5787). Five sites flagged; all reduce duplication of explanations that already live elsewhere in the same file. - command_resolver.py: the ``max(cos(name), cos(desc))`` rationale was spelled out in the _index field comment, the index() docstring, AND the resolve() docstring step 3. Kept the field comment as the canonical home; tightened the two docstrings. - dispatcher.py: dropped the second sentence of the cache comment that narrated the old Dict-and-clear shape we just replaced — developer-diary noise. - test_slash_allowed_tools_enforcement.py: tightened two test comments that paraphrased the production rationale (test names already describe the contract). - test_slash_command_resolver.py: dropped a preamble that restated the test function name. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

illeatmyhat · 2026-05-17T18:49:06Z

Superseded by #235 — same head (8105ffe), re-opened directly on cuga-project/cuga-agent so CI gets org secrets (fork-PR runs unconditionally lack base-repo secret access). Will close this once #235's CI is verified green.

First-class slash-command support: three built-ins (/help, /clear, /skills), invocation of repo-defined skills (e.g. /echo), an autocomplete dropdown, embedding-based suggestions for typos, and an allowed-tools whitelist enforced at the HITL approval gate. Skill invocation surfaces as a "Skill invoked: /<name>" step inside the assistant message's reasoning panel (toggle relabelled "Show details") instead of a separate chip bubble. Backend: - Slash dispatcher, embedding-backed command resolver, allowed-tools gate wired into the HITL approval flow, semantic suggestions, GET /api/commands. - ChatHistoryMessage discriminated-union rehydration so ToolMessage / AIMessage.tool_calls round-trip correctly across SSE. - Per-skill allowed_tools enforced at approval; resume-path inherits the same trust boundary. Frontend: - Autocomplete dropdown + clickable suggestion chips integrated with Carbon's composer textarea (ARIA combobox wiring, thread-rotation support, dark mode). - /clear UX: reset thread state without page reload. - Reasoning-panel-as-history: slash invocations replay as steps, no separate SlashSkillChip element. Tests: - 116 unit tests covering resolver, args substitution, allowed-tools enforcement, message synthesis. - Playwright e2e: dropdown, help, skills, echo, suggestions, /clear. Closes #N/A. Supersedes the fork PR (#230). Re-hosted on the org repo so CI receives org secrets.

illeatmyhat and others added 30 commits May 15, 2026 20:59

style(slash): apply ruff format pass

46c63a1

Whitespace/line-join cleanups in pre-existing slice files surfaced by `uv run ruff format`. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

illeatmyhat and others added 18 commits May 15, 2026 23:53

docs(slash): tighten AST-rationale to a single line

e67fba2

Replace the multi-paragraph "Why AST instead of regex" comparison with one sentence stating what the AST walk actually matches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

style(slash): drop verbose rationale around /clear hook logger.exception

803844a

logger.exception in an except branch is a standard idiom — the preceding three-line rationale was narration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(sdk): drop the second sentence of the slash-invoke control-flow …

bc606c8

…comment The "load_skill message quad" sentence pre-explained an injection step that lives below; the kind-routing comment above stays. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(skills): tighten arg-substitution comment

54637b2

Replace the three-line narration about model-initiated calls with one line stating when substitution runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

illeatmyhat force-pushed the slash-commands-backend branch from b243abe to 842e092 Compare May 16, 2026 07:22

illeatmyhat force-pushed the slash-commands-backend branch from 842e092 to 273fda7 Compare May 16, 2026 07:34

illeatmyhat and others added 5 commits May 16, 2026 00:52

illeatmyhat closed this May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: slash commands and skills (/help, /clear, /skills, /<skill>) with autocomplete + reasoning UX#230

feat: slash commands and skills (/help, /clear, /skills, /<skill>) with autocomplete + reasoning UX#230
illeatmyhat wants to merge 88 commits into
cuga-project:mainfrom
illeatmyhat:slash-commands-backend

illeatmyhat commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

illeatmyhat commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

illeatmyhat commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

For human reviewers

Screenshots

What's in this PR

For your agent reviewer

0. Environment prereqs

1. Wire format sanity check (curl)

2. Playwright live suite

3. Known pre-existing failures (not regressions)

4. Report

Test plan

Summary by CodeRabbit

Uh oh!

illeatmyhat commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

illeatmyhat commented May 16, 2026 •

edited by coderabbitai Bot

Loading