feat: slash commands and skills (/help, /clear, /skills, /<skill>) with autocomplete + reasoning UX#230
Closed
illeatmyhat wants to merge 88 commits into
Closed
Conversation
Introduce a shared slash_commands package that both POST /stream and the SDK's CugaAgent.invoke() route through, plus a plugin-style builtins package auto-discovered at startup. The first built-in, /help, lists every command in the merged (built-ins + skills) registry. GET /api/commands returns the same registry for the upcoming autocomplete UI. Unknown slash commands return an "Unknown command" response instead of silently flowing to the planner; slice #20 will replace that with embedding- based suggestions. Skill-kind dispatch is detected here so slice #17 can wire in the synthetic load_skill tool-call pair without changing this seam. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… #17 Three backend slices on the scaffold from #14: - `/clear` (#15) mints a fresh thread id and emits ThreadIdChanged on the stream so the frontend can adopt it as the next X-Thread-ID. The old thread remains visible in /api/conversation-threads. SDK surfaces the new id via InvokeResult.thread_id. - `/skills` (#16) lists every discovered skill with its description and any declared requirements. SkillRegistry gains an `entries()` accessor so the built-in doesn't reach into private state. - Slice #17 wires user-side skill invocation. The dispatcher loads the skill body via registry.load_skill() and produces a 4-message pair (HumanMessage / AIMessage with load_skill tool_call / ToolMessage with the wrapped body / optional trailing HumanMessage for args). The synthetic AIMessage carries additional_kwargs={invoked_via: 'slash', raw_input, resolved_name} so slice #22 can collapse the pair into a chip and the persistence layer can distinguish slash-invoked from model-initiated load_skill calls. The pure function lives in slash_commands/message_synthesis.py and is exhaustively unit-tested. Server (event_stream) and SDK (CugaAgent.invoke) both consume DispatchResult.injected_messages by prepending the 4 messages to chat_messages before the planner runs. Args are appended verbatim to the wrapped body in this slice — full $ARGUMENTS-style substitution lands in slice #19, which will substitute on the raw SKILL.md body BEFORE the install/sandbox wrapper to avoid hitting `$` in shell commands inside the wrapper text. Follow-ups still owed by this PR series: - Langfuse span emission for parse_and_dispatch (currently a structured log line; PRD calls for a real span — easier to land alongside slice #20 where unknown-command attributes also need recording). - Frontend handler for ThreadIdChanged event (small touch in customSendMessage.ts; ships with slice #18). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Render a slash-command dropdown over the Carbon AI Chat composer when the user types '/' as the first non-whitespace character. The dropdown fetches GET /api/commands on open, lists each built-in or skill with its description and (when present) argument hint, and distinguishes the two kinds with a chip-style tag. Navigation: arrow up/down to move, Enter to accept, Escape to dismiss; click outside or blur the composer to dismiss. Selecting a command inserts '/<name> ' (trailing space) into the textarea. Typing past the slash filters the list by case-insensitive name prefix. The dropdown does not appear when the slash is not the first non-whitespace character. The Carbon composer textarea lives inside a shadow root, so the component walks the shadow tree to find the <textarea>, listens for input/keydown/blur on it directly, and renders the dropdown via a React portal anchored above the textarea using fixed positioning. The portal target is the chat wrapper div in CarbonChat.tsx. Styles ship in CarbonChat.css with a prefers-color-scheme dark variant so the dropdown matches the existing chat appearance in both modes. Not in this slice (deferred to follow-up touches): - Frontend handler for the ThreadIdChanged event emitted by /clear in slice #15 (small change in customSendMessage.ts). - Frontend rendering of the slash-invocation message pair as a collapsed chip — that is slice #22. - Clickable suggestion chips for unknown commands — slice #23. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e model #17
Slice #17 verification surfaced that `call_model` in `cuga_lite_graph.py`
silently skips any non-Human/non-AI message and drops `tool_calls` on
AIMessages. With slash skill invocation, the planner received only the user
turns and an empty assistant turn — the ToolMessage carrying the wrapped
skill body never reached the LLM, so the model could not act on the loaded
skill.
Two surgical changes:
1. When an AIMessage has `tool_calls`, emit the OpenAI Chat Completions
`tool_calls` shape (id / type=function / function.name / function.arguments
stringified) so the model sees its prior call. Pure-content AIMessages keep
the existing pass-through path.
2. Add a `ToolMessage` branch that emits `{role: "tool", tool_call_id,
content}` so the wrapped skill body is paired by id to the AIMessage above
it. This is the standard format both Azure OpenAI and Anthropic Bedrock
proxies expect through LiteLLM.
This is a general improvement to `call_model` — any LangChain-style tool-call
pair (not just slash-synthesized ones) now reaches the model correctly. Slice
#17's pre-stuffed tool-call pair was the trigger to find the gap.
End-to-end LLM verification of /<skill> dispatch is still owed; the local
LiteLLM proxy hangs on every call in this environment (including baseline
non-slash invocations), so the timeout is environmental, not message-shape.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #19 — Claude-Code-style argument substitution on skill bodies: * New `arg_substitution.py` runs four passes over the raw SKILL.md body before the install/sandbox wrap: `${N:-default}`, `${N}`, `$N`, `$ARGUMENTS`. POSIX-y but intentionally narrower than `envsubst` — leftover `$VARS` are preserved verbatim. * `SkillEntry.allowed_tools` parsed + stored from the SKILL.md frontmatter (enforcement deferred to slice #21 — HITL). * `dispatcher` now hands `parsed.raw_args` into `skill_registry.load_skill` so substitution runs against the raw body, not the wrapped one. The legacy "ARGUMENTS: …" trailer was dropped. Slice #20 — unknown-command embedding resolver: * New `command_resolver.CommandResolver` builds a lazy embedding index over every known command (built-ins + skills) and ranks the top-N semantic matches for a mistyped slash command. Backed by a minimal in-process `_InMemoryEmbeddingStore` — unknown-command resolution doesn't need cross-process persistence. * Dispatcher integration: `parse_and_dispatch` accepts a `command_resolver_factory`, `_resolve_unknown` enriches the unknown result with `DispatchResult.suggestions`, and every dispatch emits a `slash_command` telemetry record (structured log + best-effort Langfuse span, gated on `advanced_features.langfuse_tracing` so the client never constructs when tracing is off). * Both stream callers (`server/main.py` + `sdk.py`) thread the factory through. Lifespan prefetches the embedding model so the first unknown command doesn't pay a cold FastEmbed download mid-request. * Server stream contract: emits `SlashSuggestions` (suggestion chips for unknown commands, slice #23) and `SlashSkillInvoked` (skill chip, slice #22) into both the live stream and the buffered history. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Render slash command outcomes as Carbon chips instead of inline text: * Slice #22 (`SlashSkillInvoked`): the synthesized load_skill / ToolMessage / Human pair from a recognized `/skill arg` invocation collapses into one compact chip showing the resolved skill name and any args. The chip is emitted both during the live turn and replayed from buffered history. * Slice #23 (`SlashSuggestions`): when the resolver returns ranked matches for an unknown command, render clickable correction chips. Clicking a chip writes the suggested command into the composer (reusing the composer-textarea utility lifted out of `SlashCommandDropdown`). * Chips are routed through Carbon's `renderUserDefinedResponse` — deliberately different from the shadow-DOM-decoration pattern slice #18 uses for the composer dropdown; both patterns coexist intentionally (see `SlashChips.tsx` docstring). * Composer textarea fix: the shared `composerTextarea.ts` traversal now matches both `<textarea>` and `contenteditable[role="textbox"]`, since `@carbon/ai-chat@1.6.0` ships the latter. **This also fixes slice #18's autocomplete-dropdown write path, which had never been browser-tested and was silently a no-op against the new contenteditable composer.** Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Browser-driven verification for the chip rendering paths (slices #22 + #23). Boots the production build in headless Chromium, stubs every backend endpoint via `page.route`, and asserts both the live-turn render and the history-reload replay. Deliberately opt-in: * Playwright is **not** in `devDependencies` — the README walks through installing it on demand. The `test:e2e` script in `package.json` errors clearly when Playwright isn't installed. * The harness uses a **zero-dep Node static server** (`static-server.cjs`) instead of `serve` / `webpack-dev-server`, both of which pull in `path-to-regexp@6` which is broken under Node 24. * Route-precedence gotcha: most-recently-registered route wins, so the spec registers the broad `**/api/**` catch-all FIRST and specific stubs after it take priority. Documented in the spec for future hands. Also adds `test-results/` and `playwright-report/` to `.gitignore` so the Playwright bookkeeping output never lands in commits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Whitespace/line-join cleanups in pre-existing slice files surfaced by `uv run ruff format`. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #21 enforces a skill's frontmatter ``allowed-tools`` whitelist at the tool-approval gate, the seam closest to where the model actually invokes tools. The whitelist applies only to the turn that immediately follows a slash-skill dispatch (per-invoke RunnableConfig scope). Semantics: * ``allowed-tools`` absent in SKILL.md frontmatter → no restriction (status quo; matches Claude Code). * ``allowed-tools: []`` (explicit empty list) → allow nothing, every bare function call (outside Python builtins) trips the gate. * ``allowed-tools: [a, b]`` → only ``a`` and ``b`` may be called freely; anything else trips the gate. When the gate trips, the call drops into the **existing HITL approval interrupt flow** so the user can still override the whitelist on a case-by-case basis (PRD #13: per-tool approval flow always applies; the whitelist is an upper bound, not a hard rejection). Wiring: * ``SkillEntry.allowed_tools`` changes type to ``Optional[tuple[str, ...]]`` so the loader can distinguish "key absent" (``None``) from "explicit empty" (``()``) — the parser previously collapsed both to ``()``. * Dispatcher propagates the whitelist onto ``DispatchResult.allowed_tools``. * Both callers (server ``AgentLoop`` + SDK ``invoke``) thread it through to ``RunnableConfig.configurable['skill_allowed_tools']``. * ``ToolApprovalHandler.check_and_create_approval_interrupt`` runs the whitelist check first. The slice #21 gate runs regardless of ``settings.policy.enabled``; the existing ToolApproval-policy check is still gated on it (moved the gate inside the handler so the whitelist isn't accidentally suppressed when the policy system is off). * Detection uses ``ast.walk`` for bare ``Call(func=Name)`` nodes — method calls (``response.get(...)``) and lambdas are correctly skipped. Python builtins are safelisted so skill authors only enumerate domain tools. 14 new unit tests (pure detection helper + handler wiring). Existing slash + skill suite is green; no regressions against baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #15's backend already mints a fresh ``thread_id`` on ``/clear`` and surfaces it through a ``ThreadIdChanged`` SSE event, but the frontend was ignoring it — the next outbound request still carried the pre-clear ``X-Thread-ID``, so the "new conversation" effect didn't actually take effect until a page reload. The handler in ``customSendMessage`` now parses the event and calls a new ``setThreadId`` setter exported from ``CarbonChat``. The current turn keeps its already-captured thread_id for any trailing stream events; the next ``customSendMessage`` call reads the rotated id via ``getOrCreateThreadId``. The wrapped ``customSendMessage`` in ``CarbonChat`` re-reads ``currentThreadId`` after each turn and fires the existing ``onThreadChange`` prop accordingly, so embedders see the rotation. ``ThreadIdChanged`` is intentionally **not** buffered into the saved stream, so history replay does not need a corresponding handler — re-applying a past clear to the currently-viewed thread would be incorrect. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related cleanups around the slice-#22/#23 e2e harness: * Add ``@playwright/test`` to ``frontend/package.json`` ``devDependencies`` (was previously left out, with a README walking through ``pnpm add -D``). npm/pnpm have no exact equivalent of pip's ``[extras]`` opt-in groups, so the standard JS-ecosystem placement is ``devDependencies`` for the JS package + a deliberate ``pnpm exec playwright install chromium`` for the ~150MB browser binary (documented in ``tests/e2e/README.md``). * Remove the unused ``puppeteer``, ``jest-image-snapshot``, and ``@types/jest-image-snapshot`` from the workspace-root ``devDependencies``. Grep finds zero imports of any of these across the repo; no Jest config exists, no other ``.test.ts``/``.spec.ts`` files anywhere. They look like placeholders for a visual-regression suite that was never wired up. A future ``feat(test): visual regression`` PR can reintroduce them when an actual snapshot test lands. Playwright structurally fits this spec better (layered ``page.route`` precedence, shadow-DOM-piercing ``getByRole`` locators against Carbon's ``contenteditable`` composer, locator auto-wait against streamed UI). Puppeteer's lane is visual regression / headless automation outside testing, not the kind of functional e2e the slash-chip spec needs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the missing slice #15 verification: `/clear` rotates the server-side thread_id, the frontend adopts it via the new `setThreadId` setter, and the **next** outbound request carries the new `X-Thread-ID` header. The Playwright route handler intercepts both `/stream` calls, captures the `X-Thread-ID` request header on each, and asserts the second-call header equals the rotated id from the first-call's `ThreadIdChanged` SSE event. A regression where the SSE event is parsed but the setter is never called would now fail this test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d e2e
Carbon AI Chat pumps every message body into a hidden ``aria-live="polite"``
announcement region for screen readers, so ``getByText("Conversation
cleared.")`` resolved to two elements and tripped Playwright's strict-mode
locator check. Targeting ``.last()`` reliably picks the visible chat bubble.
Also picks up the lockfile reconciliation from commit 2b19e8d (the
puppeteer/jest-image-snapshot removal) — ``pnpm install`` regenerated
``pnpm-lock.yaml`` to drop those subtrees. The lockfile delta is large
(~667 deletions) but is purely the consequence of the prior dep cleanup,
not a test-fix change.
All 4 e2e tests now green locally.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #20's ``_emit_slash_telemetry`` constructs a best-effort Langfuse span gated on ``advanced_features.langfuse_tracing``. A real Langfuse backend isn't required to verify the contract — patching ``langfuse.get_client`` to a spy captures the exact ``start_observation`` call and lets us assert on it deterministically. 5 tests cover: * The tracing-disabled gate — confirms the dispatcher does not even *import* the Langfuse client when ``langfuse_tracing`` is off, which prevents the noisy "client initialized without public_key" warning that pollutes production logs. * Span name + shape (``name="slash_command"``, ``as_type="span"``, input/output/metadata) for ``builtin`` dispatches. * Args propagation for ``skill`` dispatches (``input.args`` carries ``raw_args``, ``output.resolved_name`` carries the skill name). * Top-suggestion attribution for ``unknown`` dispatches (PRD #13 story 34: mine telemetry for skills users wished existed). * Telemetry failures are swallowed — a Langfuse explosion must never break dispatch. The full network/UI verification still requires a real Langfuse instance, but that's now scoped to "does Langfuse render what we sent" rather than "do we send the right thing". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The chip CSS already carries ``@media (prefers-color-scheme: dark)`` rules
(``CarbonChat.css:703-755``) that override the light-mode background
(``#f6f8fa``) with a dark equivalent (``#21262d``). The risk this test
guards against is a future refactor that hardcodes a light value outside
the media query and silently regresses dark-mode appearance.
Uses Playwright's ``page.emulateMedia({ colorScheme: "dark" })`` to flip
the browser's color scheme before navigating, then asserts the chip
summary's computed ``background-color`` equals ``rgb(33, 38, 45)``
(``#21262d``). Exact-value assertion rather than "not light" — a partial
regression (dropped dark rule, mistyped selector) shows up as a meaningful
failure instead of a vague "looks wrong".
5/5 e2e tests green locally.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…across graph checkpoints
Slice #17 was the first thing in this codebase to put a tool_call-bearing
AIMessage and a matching ToolMessage into ``AgentState.chat_messages`` for
persistence. That surfaced a dormant Pydantic-v2 + LangGraph bug:
1. ``chat_messages: Optional[List[BaseMessage]]`` declared the field as
the parent class. Pydantic v2 serializes by the *declared* type, so
``model_dump()`` dropped every subclass-specific field —
``AIMessage.tool_calls`` and ``ToolMessage.tool_call_id`` vanished
from the dict the checkpointer stored.
2. On rehydration, Pydantic created plain ``BaseMessage`` instances
(no discriminated union), so cuga_lite's reader at
``call_model:1908+`` hit the ``else`` branch and silently skipped
each ToolMessage with "Skipping message ... with unknown type".
End-user impact: a slash-dispatched skill's ``load_skill`` stanza
vanished after the first turn. Asked "what was the last tool you
called?" on turn 2, the model would say "I have not called any tools
yet in this conversation" — the loaded skill body was gone.
Fix combines two changes on ``AgentState`` (and the mirrored
``CugaLiteState``):
* Wrap the annotation with Pydantic v2's ``SerializeAsAny[BaseMessage]``
so ``model_dump()`` serializes each message by its *runtime* subclass
instead of the declared parent. ``tool_calls`` and ``tool_call_id``
now survive the dump.
* Add a ``mode="before"`` field validator (``_rehydrate_message_subclasses``)
that walks the incoming list and reconstructs proper subclass
instances from any dict-shaped or BaseMessage-shaped items, keyed on
the ``type`` discriminator. Existing subclass instances pass through
unchanged.
Verified live against the LiteLLM endpoint with a 2-turn flow:
Turn 1: ``/kwargs-check`` → model emits ``KWARGS-OK`` (slice #17 happy path).
Turn 2: "What was the last tool you called?" → model now correctly
answers "The last tool I called was ``load_skill`` with the
argument ``{\"name\":\"kwargs-check\"}``." No more "Skipping
message" warnings.
7 new regression tests in ``tests/unit/test_agent_state_message_rehydration.py``
cover both write-side (model_dump preserves the subclass fields) and
read-side (round-trip restores proper subclasses) plus the helper's
edge cases (existing subclasses pass through, plain BaseMessage gets
re-promoted, None/empty handled).
Pre-existing ``test_skill_registry_load_skill_emits_install_steps_for_requirements``
still fails on both branches (unrelated to slice #17 — flagged separately).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ``StreamEvent.format`` previously short-circuited to bare ``self.data`` for non-``Answer`` events under ``OutputFormat.DEFAULT``, leaving slash events without an ``event:`` prefix and gluing the next event to the raw JSON. - Always wrap with ``event: <name>\n`` + one ``data:`` line per logical line of the payload (preserving blank lines per the SSE spec) so the client can classify each event reliably. - Add ``StreamEvent.parse`` as the round-trip inverse so tests and the history replay path share one shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ator - The agent-loop generator already yields fully-formed ``event: <name>\ndata: <body>\n\n`` blocks; feeding the block back into a fresh ``StreamEvent(data=event).format(...)`` wrapped it a second time, leaking the inner wire format into the rendered chat bubble. - Under ``OutputFormat.DEFAULT`` emit ``event`` verbatim; keep the wrap-and-prepare path only for ``OutputFormat.WXO``, whose ``format_event`` partitions on ``data:`` and so genuinely expects a pre-formatted block as input. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace the SerializeAsAny + manual validator pair with a Pydantic discriminated union over ``type`` so HumanMessage / AIMessage / ToolMessage rehydrate to their concrete classes automatically when state is loaded from the LangGraph checkpoint. - Chunks are deliberately rejected: no production path persists chunks, and accepting them would obscure tool_call / tool_call_id loss across the checkpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop the local ``SerializeAsAny[BaseMessage]`` field + manual ``_rehydrate_message_subclasses`` validator; reuse the typed ``ChatHistoryMessage`` alias from ``agent_state`` so CugaLite's ``chat_messages`` rehydrates to concrete subclasses through the shared discriminated union. - Adjacent comment cleanup on the OpenAI tool_calls / ToolMessage mapping (no behaviour change). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- The slash-commands feature is no longer being built slice-by-slice; references to PRD #13 / slice #14 / slice #17 / slice #19 / slice #20 / slice #21 / slice #22 made the modules read like a build log. - Replace every reference with self-describing prose so the comments stand on their own without the slice index. - Pure documentation; no behaviour change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ``findComposerInput`` now prefers the *visible* (non-zero-rect) composer when multiple candidates are reachable through the shadow DOM. Carbon AI Chat sometimes leaves the previous composer node attached for a beat after submit, and binding listeners to that orphan caused the slash autocomplete dropdown to never reopen. - Expose ``isComposerStale`` (detached or zero-rect) so the dropdown's MutationObserver can decide whether to re-resolve its reference instead of waiting for an unmount that may never happen. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r when args present - After a slash command is submitted Carbon replaces the composer contenteditable host; the dropdown's MutationObserver now uses the shared ``isComposerStale`` check and re-resolves the reference when the live node changes identity, so the next ``/`` reopens the menu. - ``handleKeyDown`` lets Enter fall through to Carbon's submit handler when the typed value already matches ``^/<name>\s+`` — typing ``/echo hello world`` followed by a single Enter now submits the args intact instead of having ``acceptCommand`` rewrite the composer to ``/echo `` and drop everything after. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- The ``⚡ /<name>`` skill chip used to render as a separate USER_DEFINED message bubble below the assistant's answer; that read as visual tool-call debug clutter. - Remove the component, the ``SLASH_SKILL`` discriminator branch in ``renderCugaUserDefinedResponse``, and the related types. The skill invocation now surfaces as a "Skill invoked: /<name>" step inside the assistant message's reasoning panel — see the follow-up ``customSendMessage`` and ``customLoadHistory`` commits. - ``SlashSuggestionsChip`` is unchanged: unknown-command suggestions stay as their own bubble because they are interactive. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ti-line SSE parsing - ``SlashSkillInvoked`` no longer calls ``addMessage`` to mount a separate chip bubble; the handler now pushes a "Skill invoked: /<name>" step onto ``collectedSteps`` and fires an ``addMessageChunk`` partial update on the active streaming shell, mirroring the ``CodeAgent_Reasoning`` pattern. - ``parseCugaStream`` accumulates every ``data:`` line in an event block into ``dataLines`` before joining with ``\n``; the previous overwrite-per-line silently truncated multi-line bodies to their last line, so markdown with blank-line separators arrived clipped. - ``Answer`` finalisation suppresses the redundant plain-text "Unknown command…" fallback when a ``SlashSuggestions`` chip has already rendered. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Buffered ``SlashSkillInvoked`` events used to rehydrate as a separate USER_DEFINED chip; rebuild them as steps on the next assembled assistant message instead, so reloaded threads match the live UX. - Buffered ``SlashSuggestions`` still rehydrate as their own bubble (the chips are clickable; they don't belong in a reasoning panel). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Carbon AI Chat exposes its toggle text via Lit element properties (``openLabelText`` / ``closedLabelText``) and matching attributes; the ``strings`` prop's i18n dispatch runs after Carbon's React → Lit handoff and so does not retro-patch already-mounted toggles. - Extend the existing MutationObserver (already in place for the attachment decorations) to walk ``cds-aichat-reasoning-steps-toggle`` and ``cds-custom-*`` variants, set both the JS properties and the reflected attributes, and reapply on every Carbon rerender so live and replayed messages both pick up the new wording. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- CUGA has no app-wide dark mode today, so the ``prefers-color-scheme: dark`` overrides on ``.cuga-slash-dropdown`` / ``.cuga-slash-option`` / ``.cuga-slash-suggestion`` / ``.cuga-slash-skill-chip`` produced dark chips inside an otherwise light app. - Strip the rule bodies; leave a single comment block recording the selectors so the styles can be reintroduced verbatim when an app-wide dark mode lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Same cleanup as the source-side ``chore(slash)`` commit: remove every ``slice #N`` / ``PRD #13`` reference from docstrings and inline comments so the tests read as self-describing checks of the behaviour they actually exercise. - No behaviour change, no assertion change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- New focused unit suite for ``StreamEvent.format`` that pins the envelope shape ``event: <name>\n`` + one ``data:`` line per logical line of the payload + the blank-line terminator. - Specifically guards against the regression where non-``Answer`` events under ``OutputFormat.DEFAULT`` returned bare ``self.data`` (no ``event:`` prefix), which on the wire glued slash events to whatever came next. - Round-trips through ``StreamEvent.parse`` to lock in symmetry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- The branch is gated on slash_registry.has_skill(parsed.name), and has_skill only returns True when the same name is in the registry's _by_name dict that entry() reads from. The lookup cannot miss. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- slash_result.suggestions is a dataclass field with default_factory =list; the getattr-with-default + 'or []' fallback can never fire. - The 'and slash_result.injected_messages' clause inside the skill branch is implied by kind=="skill": synthesize_skill_invocation always returns at least three messages, so injected is invariant non-empty when kind=="skill". - StreamEvent.parse only raises ValueError; the IndexError half of except (ValueError, IndexError) was relic of the previous hand- rolled string-split parser and is now unreachable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- parse_markdown_with_frontmatter only raises ValueError (yaml.YAMLError is wrapped to ValueError inside the parser) and may bubble OSError from the underlying open() call. Catching Exception was wide enough to swallow real programming errors; narrow to the two reachable classes so unexpected exceptions surface as actual crashes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- getBoundingClientRect is a non-optional method on Element (and has been since IE5); the ?.() optional chain and the rect-is-null guard can never fire. The (el as HTMLElement) cast was also unnecessary since the method lives on Element itself. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Event.composedPath is non-optional on Event in lib.dom.d.ts and has been universally implemented since 2018. The 'typeof === "function"' fallback to [] was guarding a code path no browser this app targets can hit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- SlashSuggestionsChipData.suggestions is typed SlashSuggestion[] (non-nullable), and both producers always construct it as an array via Array.isArray(...) ? ... : []. The !data.suggestions half of the guard can never fire; the length===0 early-return stays. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Output of bash build.sh capturing the SlashCommandDropdown, SlashChips, and composerTextarea theater removals so the deployed bundle drops the unreachable optional-chain / function-type-check / null-fallback branches too. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace 17-line "Design" essay with a 3-line summary that names the actual constraint (EmbeddingStoreBackend protocol doesn't expose embeddings on list). - Drop the four inline `# 1./2./3./4.` breadcrumbs inside resolve(); the docstring already enumerates the steps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace the multi-line PRD-quoting cache rationale with a single-line comment that conveys the same information. - Drop the trailing "# only the latest registry snapshot is useful" narration after the .clear() call. - Drop the $ARGUMENTS-substitution restatement comment in the skill branch (load_skill(name, args) is self-documenting). - Drop the cross-reference comment in the skill-load except branch (the sibling branch already documents this idiom). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "deep, pure module ... per PRD #13's testing-decisions section" paragraph was self-referential ticket metadata, not documentation that helps a reader of the module. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the multi-paragraph "Why AST instead of regex" comparison with one sentence stating what the AST walk actually matches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logger.exception in an except branch is a standard idiom — the preceding three-line rationale was narration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…docstring - Collapse two adjacent multi-line comments narrating the WXO/DEFAULT branching (one about format_event partitioning, one about the double-envelope bug) into a single three-line comment immediately before the branch. - Reduce the get_commands docstring to a single sentence; drop the kind-discriminator restatement that the response shape already expresses. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…comment The "load_skill message quad" sentence pre-explained an injection step that lives below; the kind-routing comment above stays. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop the "(it's the tighter, skill-scoped contract)" parenthetical from the precedence docstring sentence. - Collapse the two adjacent comments about unconditional whitelist and gated policy into one line. - Tighten the whitelist-enforcement except comment to one line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the three-line narration about model-initiated calls with one line stating when substitution runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Collapse composer-discovery + shadow-boundary multi-line comments
into single-line explanations.
- Drop trivially-named-code comments ("Filter case-insensitively by
name prefix.", "Clamp highlight index when filtered list shrinks.").
- Drop the ARIA-effect "Safety:" paragraph that preemptively explained
the observer scope; keep the APG-pattern rationale.
- Tighten the cleanup comment in the ARIA effect to one line.
- Tighten interval-bootstrap comment to one line.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- setComposerAriaAttributes: drop "setAttribute on detached nodes is safe..." sentence (callers need not guard). - clearComposerAriaAttributes: drop "removeAttribute on detached node is a no-op" sentence (same theme). Both restate browser behaviour rather than the helper's contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
b243abe to
842e092
Compare
Two-pass audit removing comments that lie about current behaviour, restate what the code says, or carry developer-diary narration from the original implementation. ~30 sites across 15 files; no source behaviour change. - server/main.py: two comment fixes. (a) A comment claimed SlashSkillInvoked renders as a "single collapsed chip" — the chip was removed earlier in this PR and the invocation now lives in the assistant message's reasoning panel; rewrote to describe current behaviour. (b) Removed the embedding-prefetch comment that called out a "~120MB FastEmbed download" cost — the function actually dispatches to OpenAI or local based on settings.storage.embedding.provider, and the prefetch intent is already obvious from the function name being called eagerly at startup. - SlashCommandDropdown.tsx: the role-restoration mechanism was explained three times in one useEffect; collapsed to one canonical spot near the originalRoleRef. - customLoadHistory.ts / customSendMessage.ts: the suppress-next-Answer rationale appeared three times; kept the canonical explanation on the flag definition. - dispatcher.py / command_resolver.py / clear.py / arg_substitution.py: removed Design essays, PRD references, and what-comments above trivially-named operations. - allowed_tools_enforcement.py / tool_approval_handler.py / loader.py / sdk.py: dropped paragraphs that restate function docstrings or paraphrase the very next line. Verified: ruff clean on touched files; pytest 112 slash unit tests pass; no frontend rebuild needed (comments don't affect bundles). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
842e092 to
273fda7
Compare
- Drop open/exec/eval/compile/__import__/breakpoint/globals/vars/locals from PYTHON_BUILTINS_SAFELIST via a new RISKY_BUILTINS deny-set. These bypassed the spirit of any file-IO-gating ``allowed-tools`` whitelist; unwhitelisted calls to them now route through HITL approval like any other gated call. - Reverse the safelist test that asserted ``open`` was auto-allowed, and add coverage that bare ``open(...)`` is flagged when not whitelisted. - Warn when a skill dispatches but ``local_state is None`` — the load_skill quad cannot be injected and the planner won't see the body. Previously silent. - Document the resume-path trust boundary: HITL approval is the gate, re-enforcing the whitelist mid-turn would not add defense and would require persisting the tuple into LangGraph state. - Clarify DispatchResult.allowed_tools as skill-only; for other kinds the field stays ``None``. - Note that later skill-search roots intentionally override earlier ones in ``discover_skills`` so the override semantics aren't mistaken for a bug on later reads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Change ``_resolver_cache`` from a misleading ``Dict[int, CommandResolver]`` to an explicit ``Optional[Tuple[int, CommandResolver]]``. The previous shape called ``.clear()`` before every insert, so it was already a size-1 cache pretending to be a many-entry one. The tuple makes the contract visible. Lock + under-lock re-check preserved. - Embed command names and descriptions as two separate vectors and rank by ``max(cos(name), cos(desc))``. Concatenating ``"<name>: <description>"`` into a single embedding biased ranking toward commands with verbose descriptions that semantically overlap with the typo. The max-of-two rule lets either half drive ranking on its own merits. - Commands with an empty description skip the second vector and rank purely on the name embedding. - Update the resolver tests: split the canned vectors so name and description halves embed independently, add coverage for description-driven ranking and the empty-description path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Escape backticks when rendering ``SlashSkillInvoked`` content in the live stream. A literal backtick in raw_input / resolved_name / raw_args would close the inline code span and break the markdown. The replay path in customLoadHistory.ts already does this; live and replay now render identically for any user input. - Cap the SlashCommandDropdown bootstrap interval at 30 attempts (~15s). In embedded hosts, shadow-rooted contexts, or content-blocked pages the textarea never resolves and the interval polled forever. After the budget we log a one-time warning and stop, so the page no longer holds a perpetual setInterval handle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rebuilds ``src/cuga/frontend/dist/`` so the deployed bundle picks up the SlashCommandDropdown poll bound and the backtick escaping in the live SlashSkillInvoked reasoning step. Source changes landed in the preceding commit on the same branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third comment-audit pass on the review-fix range (4eb418f..a0c5787). Five sites flagged; all reduce duplication of explanations that already live elsewhere in the same file. - command_resolver.py: the ``max(cos(name), cos(desc))`` rationale was spelled out in the _index field comment, the index() docstring, AND the resolve() docstring step 3. Kept the field comment as the canonical home; tightened the two docstrings. - dispatcher.py: dropped the second sentence of the cache comment that narrated the old Dict-and-clear shape we just replaced — developer-diary noise. - test_slash_allowed_tools_enforcement.py: tightened two test comments that paraphrased the production rationale (test names already describe the contract). - test_slash_command_resolver.py: dropped a preamble that restated the test function name. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collaborator
Author
illeatmyhat
added a commit
that referenced
this pull request
May 17, 2026
First-class slash-command support: three built-ins (/help, /clear, /skills), invocation of repo-defined skills (e.g. /echo), an autocomplete dropdown, embedding-based suggestions for typos, and an allowed-tools whitelist enforced at the HITL approval gate. Skill invocation surfaces as a "Skill invoked: /<name>" step inside the assistant message's reasoning panel (toggle relabelled "Show details") instead of a separate chip bubble. Backend: - Slash dispatcher, embedding-backed command resolver, allowed-tools gate wired into the HITL approval flow, semantic suggestions, GET /api/commands. - ChatHistoryMessage discriminated-union rehydration so ToolMessage / AIMessage.tool_calls round-trip correctly across SSE. - Per-skill allowed_tools enforced at approval; resume-path inherits the same trust boundary. Frontend: - Autocomplete dropdown + clickable suggestion chips integrated with Carbon's composer textarea (ARIA combobox wiring, thread-rotation support, dark mode). - /clear UX: reset thread state without page reload. - Reasoning-panel-as-history: slash invocations replay as steps, no separate SlashSkillChip element. Tests: - 116 unit tests covering resolver, args substitution, allowed-tools enforcement, message synthesis. - Playwright e2e: dropdown, help, skills, echo, suggestions, /clear. Closes #N/A. Supersedes the fork PR (#230). Re-hosted on the org repo so CI receives org secrets.
illeatmyhat
added a commit
that referenced
this pull request
May 20, 2026
First-class slash-command support: three built-ins (/help, /clear, /skills), invocation of repo-defined skills (e.g. /echo), an autocomplete dropdown, embedding-based suggestions for typos, and an allowed-tools whitelist enforced at the HITL approval gate. Skill invocation surfaces as a "Skill invoked: /<name>" step inside the assistant message's reasoning panel (toggle relabelled "Show details") instead of a separate chip bubble. Backend: - Slash dispatcher, embedding-backed command resolver, allowed-tools gate wired into the HITL approval flow, semantic suggestions, GET /api/commands. - ChatHistoryMessage discriminated-union rehydration so ToolMessage / AIMessage.tool_calls round-trip correctly across SSE. - Per-skill allowed_tools enforced at approval; resume-path inherits the same trust boundary. Frontend: - Autocomplete dropdown + clickable suggestion chips integrated with Carbon's composer textarea (ARIA combobox wiring, thread-rotation support, dark mode). - /clear UX: reset thread state without page reload. - Reasoning-panel-as-history: slash invocations replay as steps, no separate SlashSkillChip element. Tests: - 116 unit tests covering resolver, args substitution, allowed-tools enforcement, message synthesis. - Playwright e2e: dropdown, help, skills, echo, suggestions, /clear. Closes #N/A. Supersedes the fork PR (#230). Re-hosted on the org repo so CI receives org secrets.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class slash-command support to CUGA: three built-ins (
/help,/clear,/skills), invocation of repo-defined skills (e.g./echo), an autocomplete dropdown, embedding-based suggestions for typos, and anallowed-toolswhitelist enforced at the HITL approval gate. The skill invocation surfaces as a "Skill invoked: /" step inside the assistant message's reasoning panel — toggle relabelled "Show details" — instead of a separate chip bubble.For human reviewers
Tell your agent to read this PR and execute the instructions in the "For your agent reviewer" section below.
Screenshots
//help/skills/echo hello world— reasoning panel expanded/sumarize→ suggestion chips/clearWhat's in this PR
src/cuga/backend/slash_commands/(parser, registry, dispatcher, message synthesis,$ARGUMENTSsubstitution, command resolver, allowed-tools enforcement, three built-ins). Stream eventsSlashSkillInvoked,SlashSuggestions,ThreadIdChanged. ChatHistoryMessage discriminated-union rehydration soAIMessage.tool_calls/ToolMessage.tool_call_idsurvive LangGraph checkpoints. SSE envelope now always wraps withevent:+data:— fixes a bug where non-Answerevents underOutputFormat.DEFAULTleaked raw JSON glued to the next event.contenteditable+ shadow-DOM + post-submit node-replacement (composerTextarea.ts). Autocomplete dropdown (SlashCommandDropdown.tsx) reopens after submit and yields Enter when args are typed. Suggestion chips (SlashChips.tsx). Skill invocation rendered as a reasoning step (customSendMessage.ts+customLoadHistory.ts). Carbon's reasoning toggle relabelled to "Show details" via MutationObserver inCarbonChat.tsx.tests/e2e/slash-chips.spec.ts) for both chip types, history replay,/clearrotation, and the "Show details" label.For your agent reviewer
You're being asked to verify that the slash-command pipeline works end-to-end on the user's machine. Do not trust the unit tests alone — the failure modes here are visual (the chip leaking to its own bubble, the dropdown not reopening, the SSE wire format leaking into the chat bubble), and they only show up when a real browser drives a real server.
0. Environment prereqs
curl -sf $OPENAI_BASE_URL/health || echo "VPN?"if you have a health route, or just try a one-shotagent.invoke("hello")and watch for a response within ~20s.uvinstalled. Runuv synconce.bash src/frontend_workspaces/frontend/build.sh—pnpm run buildalone does not deploy. The script doespnpm run buildand copiesdist/intosrc/cuga/frontend/dist/, which is what the FastAPI server actually serves..agents/skills/echo/SKILL.md(the.agents/directory is gitignored, so it won't be in your checkout) with the following:.venv/bin/cuga start demo_skills— wait untilcurl -sf http://127.0.0.1:7860/api/commandsreturns 200.1. Wire format sanity check (curl)
Expected: each event is one
event: <Name>\nfollowed by onedata: ...\nfollowed by a blank line. If you seedata: event:ordata: data:anywhere in the output, the SSE envelope is being double-wrapped (regression ofmain.py:1771).2. Playwright live suite
Mirror the patterns in
src/frontend_workspaces/frontend/tests/e2e/slash-chips.spec.ts(stubbed) but point the new spec at the live server. Critical patterns:contenteditablehost withrole="textbox", not a<textarea>. Read withtextContent, set with the patterns incomposerTextarea.ts.getBoundingClientRect().width > 0).MutationObservermust re-resolve when the live composer node changes identity. Test this by submitting/skillsthen typing/again and asserting the dropdown reopens.^/<name>\s+), Enter should fall through to Carbon's submit. Test/echo hello worldwith a single Enter press and assert the outbound/streampayload contains"hello world".The minimum acceptance suite:
/.cuga-slash-dropdownvisible, lists/clear,/help,/skills,/echo/help/help,/clear,/skills/skills/clearX-Thread-IDon the next/streamrequest differs from the prior one (capture viapage.on("response", ...))/echo hello world(single Enter).cuga-slash-skill-chipelement anywhere; (b) at least one assistant bubble equals exactly"hello world"(filter the user message out first); (c) the rendered text does not contain the raw SSE wire formatevent:ordata:/sumarize.cuga-slash-suggestionsvisible with at least one chip; no bubble text matches/Unknown command:/(the suppression contract — chip's own header text isUnknown command <name>. Did you mean:without colon, so the colon-anchored regex distinguishes them)/skills/skills, wait for response, type/, assert dropdown visible againSample spec scaffolding lives in
/tmp/full-suite.spec.ts,/tmp/dropdown-reopen.spec.ts, and/tmp/echo-args.spec.tson the originating dev machine — recreate them in a/tmp/directory of your own. Usepnpm exec playwright test --config <your-config>.ts --reporter=linefromsrc/frontend_workspaces/frontend/.3. Known pre-existing failures (not regressions)
Both fail identically on
main; do not chase:tests/unit/test_skill_loader.py::test_skill_registry_load_skill_emits_install_steps_for_requirementstests/unit/test_find_tools_exception.py::test_find_tools_composes_query_with_initial_user_message4. Report
Run the live suite, paste the pass/fail tally, and flag any of the following that you observe — these were the bugs the development session caught and would be the most informative regressions:
/echo hello worldarrives at the server with emptyraw_args/cleardoes not rotate the thread id⚡ /<name>chip bubble (should be a reasoning step)event: …/data: …text appears inside any assistant bubbleTest plan
uv run pytest tests/unit/)pnpm exec playwright test tests/e2e)cuga start demo_skills)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Frontend
Tool Approval
Streaming
Tests
Chores