Skip to content

feat: slash commands and skills (/help, /clear, /skills, /<skill>) with autocomplete + reasoning UX#230

Closed
illeatmyhat wants to merge 88 commits into
cuga-project:mainfrom
illeatmyhat:slash-commands-backend
Closed

feat: slash commands and skills (/help, /clear, /skills, /<skill>) with autocomplete + reasoning UX#230
illeatmyhat wants to merge 88 commits into
cuga-project:mainfrom
illeatmyhat:slash-commands-backend

Conversation

@illeatmyhat

@illeatmyhat illeatmyhat commented May 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds first-class slash-command support to CUGA: three built-ins (/help, /clear, /skills), invocation of repo-defined skills (e.g. /echo), an autocomplete dropdown, embedding-based suggestions for typos, and an allowed-tools whitelist enforced at the HITL approval gate. The skill invocation surfaces as a "Skill invoked: /" step inside the assistant message's reasoning panel — toggle relabelled "Show details" — instead of a separate chip bubble.

For human reviewers

Tell your agent to read this PR and execute the instructions in the "For your agent reviewer" section below.

Screenshots

Autocomplete dropdown on / /help
dropdown help
/skills /echo hello world — reasoning panel expanded
skills echo
Unknown /sumarize → suggestion chips /clear
suggestions clear

What's in this PR

  • Backendsrc/cuga/backend/slash_commands/ (parser, registry, dispatcher, message synthesis, $ARGUMENTS substitution, command resolver, allowed-tools enforcement, three built-ins). Stream events SlashSkillInvoked, SlashSuggestions, ThreadIdChanged. ChatHistoryMessage discriminated-union rehydration so AIMessage.tool_calls / ToolMessage.tool_call_id survive LangGraph checkpoints. SSE envelope now always wraps with event: + data: — fixes a bug where non-Answer events under OutputFormat.DEFAULT leaked raw JSON glued to the next event.
  • Frontend — Carbon AI Chat composer hardened against contenteditable + shadow-DOM + post-submit node-replacement (composerTextarea.ts). Autocomplete dropdown (SlashCommandDropdown.tsx) reopens after submit and yields Enter when args are typed. Suggestion chips (SlashChips.tsx). Skill invocation rendered as a reasoning step (customSendMessage.ts + customLoadHistory.ts). Carbon's reasoning toggle relabelled to "Show details" via MutationObserver in CarbonChat.tsx.
  • Tests — Python unit suites for stream envelope, ChatHistoryMessage rehydration, every slash module, and the Langfuse span contract. Playwright e2e (tests/e2e/slash-chips.spec.ts) for both chip types, history replay, /clear rotation, and the "Show details" label.

For your agent reviewer

You're being asked to verify that the slash-command pipeline works end-to-end on the user's machine. Do not trust the unit tests alone — the failure modes here are visual (the chip leaking to its own bubble, the dropdown not reopening, the SSE wire format leaking into the chat bubble), and they only show up when a real browser drives a real server.

0. Environment prereqs

  1. LiteLLM endpoint reachable. CUGA's planner uses LiteLLM (Azure/gpt-4.1 by default). If you're behind a corporate VPN that fronts LiteLLM, connect it before doing anything else — without VPN, calls hang silently rather than failing. Confirm with curl -sf $OPENAI_BASE_URL/health || echo "VPN?" if you have a health route, or just try a one-shot agent.invoke("hello") and watch for a response within ~20s.
  2. uv installed. Run uv sync once.
  3. Frontend bundle built. Run bash src/frontend_workspaces/frontend/build.shpnpm run build alone does not deploy. The script does pnpm run build and copies dist/ into src/cuga/frontend/dist/, which is what the FastAPI server actually serves.
  4. Echo skill on disk. Create .agents/skills/echo/SKILL.md (the .agents/ directory is gitignored, so it won't be in your checkout) with the following:
    ---
    name: echo
    description: Echoes its arguments back verbatim. A minimal skill for verifying the slash-command pipeline end-to-end.
    ---
    
    You are an echo. Respond with the user's arguments verbatim — exactly as they wrote them, nothing else.
    
    Do not add a confirmation. Do not add a preamble. Do not add a timestamp. Do not quote the args. Do not call any tools. Do not plan further steps.
    
    If the user passed no arguments, respond with the single word: `(empty)`.
  5. Server up. .venv/bin/cuga start demo_skills — wait until curl -sf http://127.0.0.1:7860/api/commands returns 200.

1. Wire format sanity check (curl)

curl -sN -X POST http://127.0.0.1:7860/stream \
  -H "Content-Type: application/json" \
  -H "X-Thread-ID: $(uuidgen)" \
  -d '{"query":"/echo hello world"}' | head -40

Expected: each event is one event: <Name>\n followed by one data: ...\n followed by a blank line. If you see data: event: or data: data: anywhere in the output, the SSE envelope is being double-wrapped (regression of main.py:1771).

2. Playwright live suite

Mirror the patterns in src/frontend_workspaces/frontend/tests/e2e/slash-chips.spec.ts (stubbed) but point the new spec at the live server. Critical patterns:

  • Carbon AI Chat 1.6 ships the composer as a contenteditable host with role="textbox", not a <textarea>. Read with textContent, set with the patterns in composerTextarea.ts.
  • Multiple composers can co-exist in the shadow DOM after a submit; prefer the visible one (getBoundingClientRect().width > 0).
  • Carbon's submit clears the composer; the autocomplete dropdown's MutationObserver must re-resolve when the live composer node changes identity. Test this by submitting /skills then typing / again and asserting the dropdown reopens.
  • The slash dropdown intercepts Enter to "accept" the highlighted command. When the user has already typed args (^/<name>\s+), Enter should fall through to Carbon's submit. Test /echo hello world with a single Enter press and assert the outbound /stream payload contains "hello world".

The minimum acceptance suite:

Scenario Assertion
Dropdown opens on / .cuga-slash-dropdown visible, lists /clear, /help, /skills, /echo
/help Response text contains /help, /clear, /skills
/skills Response text contains "Installed skills" + "echo" + the skill description
/clear X-Thread-ID on the next /stream request differs from the prior one (capture via page.on("response", ...))
/echo hello world (single Enter) (a) no .cuga-slash-skill-chip element anywhere; (b) at least one assistant bubble equals exactly "hello world" (filter the user message out first); (c) the rendered text does not contain the raw SSE wire format event: or data:
/sumarize .cuga-slash-suggestions visible with at least one chip; no bubble text matches /Unknown command:/ (the suppression contract — chip's own header text is Unknown command <name>. Did you mean: without colon, so the colon-anchored regex distinguishes them)
Dropdown re-opens after /skills Submit /skills, wait for response, type /, assert dropdown visible again

Sample spec scaffolding lives in /tmp/full-suite.spec.ts, /tmp/dropdown-reopen.spec.ts, and /tmp/echo-args.spec.ts on the originating dev machine — recreate them in a /tmp/ directory of your own. Use pnpm exec playwright test --config <your-config>.ts --reporter=line from src/frontend_workspaces/frontend/.

3. Known pre-existing failures (not regressions)

Both fail identically on main; do not chase:

  • tests/unit/test_skill_loader.py::test_skill_registry_load_skill_emits_install_steps_for_requirements
  • tests/unit/test_find_tools_exception.py::test_find_tools_composes_query_with_initial_user_message

4. Report

Run the live suite, paste the pass/fail tally, and flag any of the following that you observe — these were the bugs the development session caught and would be the most informative regressions:

  • Dropdown does not reopen after submit
  • /echo hello world arrives at the server with empty raw_args
  • /clear does not rotate the thread id
  • Suggestion chips never render for an unknown command
  • Skill invocation appears as a separate ⚡ /<name> chip bubble (should be a reasoning step)
  • "Show Reasoning" appears instead of "Show details"
  • Raw event: … / data: … text appears inside any assistant bubble

Test plan

  • Python unit suites green (uv run pytest tests/unit/)
  • Frontend e2e stubbed suite green (pnpm exec playwright test tests/e2e)
  • Live-server Playwright suite green (9/9 against cuga start demo_skills)
  • Curl-confirmed SSE envelope shape
  • Reviewer's agent re-ran the live suite on their machine

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Full slash-command system: built-ins (/help, /skills, /clear), skill argument substitution, semantic suggestions, and GET /api/commands.
    • Embedding-backed command resolver and slash dispatch integrated into SDK/agent flows.
  • Frontend

    • Autocomplete dropdown, clickable suggestion chips, composer integration, thread-rotation support, and UI styling.
  • Tool Approval

    • Per-skill allowed-tools whitelist enforced alongside existing approval flow.
  • Streaming

    • Robust SSE parsing/formatting and safer stream handling.
  • Tests

    • Extensive unit + Playwright e2e coverage.
  • Chores

    • .gitignore updated to skip frontend test artifacts.

illeatmyhat and others added 30 commits May 15, 2026 20:59
Introduce a shared slash_commands package that both POST /stream and the
SDK's CugaAgent.invoke() route through, plus a plugin-style builtins
package auto-discovered at startup. The first built-in, /help, lists every
command in the merged (built-ins + skills) registry. GET /api/commands
returns the same registry for the upcoming autocomplete UI.

Unknown slash commands return an "Unknown command" response instead of
silently flowing to the planner; slice #20 will replace that with embedding-
based suggestions. Skill-kind dispatch is detected here so slice #17 can
wire in the synthetic load_skill tool-call pair without changing this seam.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… #17

Three backend slices on the scaffold from #14:

- `/clear` (#15) mints a fresh thread id and emits ThreadIdChanged on the
  stream so the frontend can adopt it as the next X-Thread-ID. The old thread
  remains visible in /api/conversation-threads. SDK surfaces the new id via
  InvokeResult.thread_id.

- `/skills` (#16) lists every discovered skill with its description and any
  declared requirements. SkillRegistry gains an `entries()` accessor so the
  built-in doesn't reach into private state.

- Slice #17 wires user-side skill invocation. The dispatcher loads the skill
  body via registry.load_skill() and produces a 4-message pair
  (HumanMessage / AIMessage with load_skill tool_call / ToolMessage with the
  wrapped body / optional trailing HumanMessage for args). The synthetic
  AIMessage carries additional_kwargs={invoked_via: 'slash', raw_input,
  resolved_name} so slice #22 can collapse the pair into a chip and the
  persistence layer can distinguish slash-invoked from model-initiated
  load_skill calls. The pure function lives in
  slash_commands/message_synthesis.py and is exhaustively unit-tested.

Server (event_stream) and SDK (CugaAgent.invoke) both consume
DispatchResult.injected_messages by prepending the 4 messages to chat_messages
before the planner runs.

Args are appended verbatim to the wrapped body in this slice — full
$ARGUMENTS-style substitution lands in slice #19, which will substitute on
the raw SKILL.md body BEFORE the install/sandbox wrapper to avoid hitting `$`
in shell commands inside the wrapper text.

Follow-ups still owed by this PR series:
- Langfuse span emission for parse_and_dispatch (currently a structured log
  line; PRD calls for a real span — easier to land alongside slice #20 where
  unknown-command attributes also need recording).
- Frontend handler for ThreadIdChanged event (small touch in
  customSendMessage.ts; ships with slice #18).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Render a slash-command dropdown over the Carbon AI Chat composer when the
user types '/' as the first non-whitespace character. The dropdown fetches
GET /api/commands on open, lists each built-in or skill with its description
and (when present) argument hint, and distinguishes the two kinds with a
chip-style tag.

Navigation: arrow up/down to move, Enter to accept, Escape to dismiss; click
outside or blur the composer to dismiss. Selecting a command inserts
'/<name> ' (trailing space) into the textarea. Typing past the slash filters
the list by case-insensitive name prefix. The dropdown does not appear when
the slash is not the first non-whitespace character.

The Carbon composer textarea lives inside a shadow root, so the component
walks the shadow tree to find the <textarea>, listens for input/keydown/blur
on it directly, and renders the dropdown via a React portal anchored above
the textarea using fixed positioning. The portal target is the chat wrapper
div in CarbonChat.tsx.

Styles ship in CarbonChat.css with a prefers-color-scheme dark variant so
the dropdown matches the existing chat appearance in both modes.

Not in this slice (deferred to follow-up touches):
- Frontend handler for the ThreadIdChanged event emitted by /clear in slice
  #15 (small change in customSendMessage.ts).
- Frontend rendering of the slash-invocation message pair as a collapsed
  chip — that is slice #22.
- Clickable suggestion chips for unknown commands — slice #23.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e model #17

Slice #17 verification surfaced that `call_model` in `cuga_lite_graph.py`
silently skips any non-Human/non-AI message and drops `tool_calls` on
AIMessages. With slash skill invocation, the planner received only the user
turns and an empty assistant turn — the ToolMessage carrying the wrapped
skill body never reached the LLM, so the model could not act on the loaded
skill.

Two surgical changes:

1. When an AIMessage has `tool_calls`, emit the OpenAI Chat Completions
   `tool_calls` shape (id / type=function / function.name / function.arguments
   stringified) so the model sees its prior call. Pure-content AIMessages keep
   the existing pass-through path.

2. Add a `ToolMessage` branch that emits `{role: "tool", tool_call_id,
   content}` so the wrapped skill body is paired by id to the AIMessage above
   it. This is the standard format both Azure OpenAI and Anthropic Bedrock
   proxies expect through LiteLLM.

This is a general improvement to `call_model` — any LangChain-style tool-call
pair (not just slash-synthesized ones) now reaches the model correctly. Slice
#17's pre-stuffed tool-call pair was the trigger to find the gap.

End-to-end LLM verification of /<skill> dispatch is still owed; the local
LiteLLM proxy hangs on every call in this environment (including baseline
non-slash invocations), so the timeout is environmental, not message-shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #19 — Claude-Code-style argument substitution on skill bodies:
  * New `arg_substitution.py` runs four passes over the raw SKILL.md body
    before the install/sandbox wrap: `${N:-default}`, `${N}`, `$N`,
    `$ARGUMENTS`. POSIX-y but intentionally narrower than `envsubst` —
    leftover `$VARS` are preserved verbatim.
  * `SkillEntry.allowed_tools` parsed + stored from the SKILL.md frontmatter
    (enforcement deferred to slice #21 — HITL).
  * `dispatcher` now hands `parsed.raw_args` into `skill_registry.load_skill`
    so substitution runs against the raw body, not the wrapped one. The
    legacy "ARGUMENTS: …" trailer was dropped.

Slice #20 — unknown-command embedding resolver:
  * New `command_resolver.CommandResolver` builds a lazy embedding index over
    every known command (built-ins + skills) and ranks the top-N semantic
    matches for a mistyped slash command. Backed by a minimal in-process
    `_InMemoryEmbeddingStore` — unknown-command resolution doesn't need
    cross-process persistence.
  * Dispatcher integration: `parse_and_dispatch` accepts a
    `command_resolver_factory`, `_resolve_unknown` enriches the unknown
    result with `DispatchResult.suggestions`, and every dispatch emits a
    `slash_command` telemetry record (structured log + best-effort Langfuse
    span, gated on `advanced_features.langfuse_tracing` so the client never
    constructs when tracing is off).
  * Both stream callers (`server/main.py` + `sdk.py`) thread the factory
    through. Lifespan prefetches the embedding model so the first unknown
    command doesn't pay a cold FastEmbed download mid-request.
  * Server stream contract: emits `SlashSuggestions` (suggestion chips for
    unknown commands, slice #23) and `SlashSkillInvoked` (skill chip,
    slice #22) into both the live stream and the buffered history.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Render slash command outcomes as Carbon chips instead of inline text:

  * Slice #22 (`SlashSkillInvoked`): the synthesized load_skill / ToolMessage /
    Human pair from a recognized `/skill arg` invocation collapses into one
    compact chip showing the resolved skill name and any args. The chip is
    emitted both during the live turn and replayed from buffered history.
  * Slice #23 (`SlashSuggestions`): when the resolver returns ranked matches
    for an unknown command, render clickable correction chips. Clicking a
    chip writes the suggested command into the composer (reusing the
    composer-textarea utility lifted out of `SlashCommandDropdown`).
  * Chips are routed through Carbon's `renderUserDefinedResponse` —
    deliberately different from the shadow-DOM-decoration pattern slice #18
    uses for the composer dropdown; both patterns coexist intentionally
    (see `SlashChips.tsx` docstring).
  * Composer textarea fix: the shared `composerTextarea.ts` traversal now
    matches both `<textarea>` and `contenteditable[role="textbox"]`, since
    `@carbon/ai-chat@1.6.0` ships the latter. **This also fixes slice #18's
    autocomplete-dropdown write path, which had never been browser-tested
    and was silently a no-op against the new contenteditable composer.**

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Browser-driven verification for the chip rendering paths (slices #22 + #23).
Boots the production build in headless Chromium, stubs every backend
endpoint via `page.route`, and asserts both the live-turn render and the
history-reload replay.

Deliberately opt-in:

  * Playwright is **not** in `devDependencies` — the README walks through
    installing it on demand. The `test:e2e` script in `package.json` errors
    clearly when Playwright isn't installed.
  * The harness uses a **zero-dep Node static server** (`static-server.cjs`)
    instead of `serve` / `webpack-dev-server`, both of which pull in
    `path-to-regexp@6` which is broken under Node 24.
  * Route-precedence gotcha: most-recently-registered route wins, so the
    spec registers the broad `**/api/**` catch-all FIRST and specific stubs
    after it take priority. Documented in the spec for future hands.

Also adds `test-results/` and `playwright-report/` to `.gitignore` so the
Playwright bookkeeping output never lands in commits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Whitespace/line-join cleanups in pre-existing slice files surfaced by
`uv run ruff format`. No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #21 enforces a skill's frontmatter ``allowed-tools`` whitelist at the
tool-approval gate, the seam closest to where the model actually invokes
tools. The whitelist applies only to the turn that immediately follows a
slash-skill dispatch (per-invoke RunnableConfig scope).

Semantics:

  * ``allowed-tools`` absent in SKILL.md frontmatter → no restriction
    (status quo; matches Claude Code).
  * ``allowed-tools: []`` (explicit empty list) → allow nothing, every bare
    function call (outside Python builtins) trips the gate.
  * ``allowed-tools: [a, b]`` → only ``a`` and ``b`` may be called freely;
    anything else trips the gate.

When the gate trips, the call drops into the **existing HITL approval
interrupt flow** so the user can still override the whitelist on a
case-by-case basis (PRD #13: per-tool approval flow always applies; the
whitelist is an upper bound, not a hard rejection).

Wiring:

  * ``SkillEntry.allowed_tools`` changes type to ``Optional[tuple[str, ...]]``
    so the loader can distinguish "key absent" (``None``) from "explicit empty"
    (``()``) — the parser previously collapsed both to ``()``.
  * Dispatcher propagates the whitelist onto ``DispatchResult.allowed_tools``.
  * Both callers (server ``AgentLoop`` + SDK ``invoke``) thread it through to
    ``RunnableConfig.configurable['skill_allowed_tools']``.
  * ``ToolApprovalHandler.check_and_create_approval_interrupt`` runs the
    whitelist check first. The slice #21 gate runs regardless of
    ``settings.policy.enabled``; the existing ToolApproval-policy check is
    still gated on it (moved the gate inside the handler so the whitelist
    isn't accidentally suppressed when the policy system is off).
  * Detection uses ``ast.walk`` for bare ``Call(func=Name)`` nodes — method
    calls (``response.get(...)``) and lambdas are correctly skipped. Python
    builtins are safelisted so skill authors only enumerate domain tools.

14 new unit tests (pure detection helper + handler wiring). Existing
slash + skill suite is green; no regressions against baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #15's backend already mints a fresh ``thread_id`` on ``/clear`` and
surfaces it through a ``ThreadIdChanged`` SSE event, but the frontend was
ignoring it — the next outbound request still carried the pre-clear
``X-Thread-ID``, so the "new conversation" effect didn't actually take
effect until a page reload.

The handler in ``customSendMessage`` now parses the event and calls a new
``setThreadId`` setter exported from ``CarbonChat``. The current turn keeps
its already-captured thread_id for any trailing stream events; the next
``customSendMessage`` call reads the rotated id via ``getOrCreateThreadId``.
The wrapped ``customSendMessage`` in ``CarbonChat`` re-reads
``currentThreadId`` after each turn and fires the existing
``onThreadChange`` prop accordingly, so embedders see the rotation.

``ThreadIdChanged`` is intentionally **not** buffered into the saved stream,
so history replay does not need a corresponding handler — re-applying a
past clear to the currently-viewed thread would be incorrect.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related cleanups around the slice-#22/#23 e2e harness:

  * Add ``@playwright/test`` to ``frontend/package.json`` ``devDependencies``
    (was previously left out, with a README walking through ``pnpm add -D``).
    npm/pnpm have no exact equivalent of pip's ``[extras]`` opt-in groups,
    so the standard JS-ecosystem placement is ``devDependencies`` for the
    JS package + a deliberate ``pnpm exec playwright install chromium`` for
    the ~150MB browser binary (documented in ``tests/e2e/README.md``).

  * Remove the unused ``puppeteer``, ``jest-image-snapshot``, and
    ``@types/jest-image-snapshot`` from the workspace-root ``devDependencies``.
    Grep finds zero imports of any of these across the repo; no Jest config
    exists, no other ``.test.ts``/``.spec.ts`` files anywhere. They look
    like placeholders for a visual-regression suite that was never wired
    up. A future ``feat(test): visual regression`` PR can reintroduce them
    when an actual snapshot test lands.

Playwright structurally fits this spec better (layered ``page.route``
precedence, shadow-DOM-piercing ``getByRole`` locators against Carbon's
``contenteditable`` composer, locator auto-wait against streamed UI).
Puppeteer's lane is visual regression / headless automation outside
testing, not the kind of functional e2e the slash-chip spec needs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the missing slice #15 verification: `/clear` rotates the server-side
thread_id, the frontend adopts it via the new `setThreadId` setter, and
the **next** outbound request carries the new `X-Thread-ID` header.

The Playwright route handler intercepts both `/stream` calls, captures the
`X-Thread-ID` request header on each, and asserts the second-call header
equals the rotated id from the first-call's `ThreadIdChanged` SSE event.
A regression where the SSE event is parsed but the setter is never called
would now fail this test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d e2e

Carbon AI Chat pumps every message body into a hidden ``aria-live="polite"``
announcement region for screen readers, so ``getByText("Conversation
cleared.")`` resolved to two elements and tripped Playwright's strict-mode
locator check. Targeting ``.last()`` reliably picks the visible chat bubble.

Also picks up the lockfile reconciliation from commit 2b19e8d (the
puppeteer/jest-image-snapshot removal) — ``pnpm install`` regenerated
``pnpm-lock.yaml`` to drop those subtrees. The lockfile delta is large
(~667 deletions) but is purely the consequence of the prior dep cleanup,
not a test-fix change.

All 4 e2e tests now green locally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Slice #20's ``_emit_slash_telemetry`` constructs a best-effort Langfuse
span gated on ``advanced_features.langfuse_tracing``. A real Langfuse
backend isn't required to verify the contract — patching ``langfuse.get_client``
to a spy captures the exact ``start_observation`` call and lets us assert
on it deterministically.

5 tests cover:

  * The tracing-disabled gate — confirms the dispatcher does not even
    *import* the Langfuse client when ``langfuse_tracing`` is off, which
    prevents the noisy "client initialized without public_key" warning
    that pollutes production logs.
  * Span name + shape (``name="slash_command"``, ``as_type="span"``,
    input/output/metadata) for ``builtin`` dispatches.
  * Args propagation for ``skill`` dispatches (``input.args`` carries
    ``raw_args``, ``output.resolved_name`` carries the skill name).
  * Top-suggestion attribution for ``unknown`` dispatches (PRD #13
    story 34: mine telemetry for skills users wished existed).
  * Telemetry failures are swallowed — a Langfuse explosion must never
    break dispatch.

The full network/UI verification still requires a real Langfuse instance,
but that's now scoped to "does Langfuse render what we sent" rather than
"do we send the right thing".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The chip CSS already carries ``@media (prefers-color-scheme: dark)`` rules
(``CarbonChat.css:703-755``) that override the light-mode background
(``#f6f8fa``) with a dark equivalent (``#21262d``). The risk this test
guards against is a future refactor that hardcodes a light value outside
the media query and silently regresses dark-mode appearance.

Uses Playwright's ``page.emulateMedia({ colorScheme: "dark" })`` to flip
the browser's color scheme before navigating, then asserts the chip
summary's computed ``background-color`` equals ``rgb(33, 38, 45)``
(``#21262d``). Exact-value assertion rather than "not light" — a partial
regression (dropped dark rule, mistyped selector) shows up as a meaningful
failure instead of a vague "looks wrong".

5/5 e2e tests green locally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…across graph checkpoints

Slice #17 was the first thing in this codebase to put a tool_call-bearing
AIMessage and a matching ToolMessage into ``AgentState.chat_messages`` for
persistence. That surfaced a dormant Pydantic-v2 + LangGraph bug:

  1. ``chat_messages: Optional[List[BaseMessage]]`` declared the field as
     the parent class. Pydantic v2 serializes by the *declared* type, so
     ``model_dump()`` dropped every subclass-specific field —
     ``AIMessage.tool_calls`` and ``ToolMessage.tool_call_id`` vanished
     from the dict the checkpointer stored.
  2. On rehydration, Pydantic created plain ``BaseMessage`` instances
     (no discriminated union), so cuga_lite's reader at
     ``call_model:1908+`` hit the ``else`` branch and silently skipped
     each ToolMessage with "Skipping message ... with unknown type".

End-user impact: a slash-dispatched skill's ``load_skill`` stanza
vanished after the first turn. Asked "what was the last tool you
called?" on turn 2, the model would say "I have not called any tools
yet in this conversation" — the loaded skill body was gone.

Fix combines two changes on ``AgentState`` (and the mirrored
``CugaLiteState``):

  * Wrap the annotation with Pydantic v2's ``SerializeAsAny[BaseMessage]``
    so ``model_dump()`` serializes each message by its *runtime* subclass
    instead of the declared parent. ``tool_calls`` and ``tool_call_id``
    now survive the dump.
  * Add a ``mode="before"`` field validator (``_rehydrate_message_subclasses``)
    that walks the incoming list and reconstructs proper subclass
    instances from any dict-shaped or BaseMessage-shaped items, keyed on
    the ``type`` discriminator. Existing subclass instances pass through
    unchanged.

Verified live against the LiteLLM endpoint with a 2-turn flow:

  Turn 1: ``/kwargs-check`` → model emits ``KWARGS-OK`` (slice #17 happy path).
  Turn 2: "What was the last tool you called?" → model now correctly
          answers "The last tool I called was ``load_skill`` with the
          argument ``{\"name\":\"kwargs-check\"}``." No more "Skipping
          message" warnings.

7 new regression tests in ``tests/unit/test_agent_state_message_rehydration.py``
cover both write-side (model_dump preserves the subclass fields) and
read-side (round-trip restores proper subclasses) plus the helper's
edge cases (existing subclasses pass through, plain BaseMessage gets
re-promoted, None/empty handled).

Pre-existing ``test_skill_registry_load_skill_emits_install_steps_for_requirements``
still fails on both branches (unrelated to slice #17 — flagged separately).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ``StreamEvent.format`` previously short-circuited to bare ``self.data``
  for non-``Answer`` events under ``OutputFormat.DEFAULT``, leaving slash
  events without an ``event:`` prefix and gluing the next event to the
  raw JSON.
- Always wrap with ``event: <name>\n`` + one ``data:`` line per logical
  line of the payload (preserving blank lines per the SSE spec) so the
  client can classify each event reliably.
- Add ``StreamEvent.parse`` as the round-trip inverse so tests and the
  history replay path share one shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ator

- The agent-loop generator already yields fully-formed
  ``event: <name>\ndata: <body>\n\n`` blocks; feeding the block back
  into a fresh ``StreamEvent(data=event).format(...)`` wrapped it a
  second time, leaking the inner wire format into the rendered chat
  bubble.
- Under ``OutputFormat.DEFAULT`` emit ``event`` verbatim; keep the
  wrap-and-prepare path only for ``OutputFormat.WXO``, whose
  ``format_event`` partitions on ``data:`` and so genuinely expects a
  pre-formatted block as input.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace the SerializeAsAny + manual validator pair with a Pydantic
  discriminated union over ``type`` so HumanMessage / AIMessage /
  ToolMessage rehydrate to their concrete classes automatically when
  state is loaded from the LangGraph checkpoint.
- Chunks are deliberately rejected: no production path persists chunks,
  and accepting them would obscure tool_call / tool_call_id loss across
  the checkpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop the local ``SerializeAsAny[BaseMessage]`` field + manual
  ``_rehydrate_message_subclasses`` validator; reuse the typed
  ``ChatHistoryMessage`` alias from ``agent_state`` so CugaLite's
  ``chat_messages`` rehydrates to concrete subclasses through the
  shared discriminated union.
- Adjacent comment cleanup on the OpenAI tool_calls / ToolMessage
  mapping (no behaviour change).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- The slash-commands feature is no longer being built slice-by-slice;
  references to PRD #13 / slice #14 / slice #17 / slice #19 / slice #20
  / slice #21 / slice #22 made the modules read like a build log.
- Replace every reference with self-describing prose so the comments
  stand on their own without the slice index.
- Pure documentation; no behaviour change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ``findComposerInput`` now prefers the *visible* (non-zero-rect)
  composer when multiple candidates are reachable through the shadow
  DOM. Carbon AI Chat sometimes leaves the previous composer node
  attached for a beat after submit, and binding listeners to that
  orphan caused the slash autocomplete dropdown to never reopen.
- Expose ``isComposerStale`` (detached or zero-rect) so the dropdown's
  MutationObserver can decide whether to re-resolve its reference
  instead of waiting for an unmount that may never happen.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r when args present

- After a slash command is submitted Carbon replaces the composer
  contenteditable host; the dropdown's MutationObserver now uses the
  shared ``isComposerStale`` check and re-resolves the reference when
  the live node changes identity, so the next ``/`` reopens the menu.
- ``handleKeyDown`` lets Enter fall through to Carbon's submit handler
  when the typed value already matches ``^/<name>\s+`` — typing
  ``/echo hello world`` followed by a single Enter now submits the args
  intact instead of having ``acceptCommand`` rewrite the composer to
  ``/echo `` and drop everything after.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- The ``⚡ /<name>`` skill chip used to render as a separate USER_DEFINED
  message bubble below the assistant's answer; that read as visual
  tool-call debug clutter.
- Remove the component, the ``SLASH_SKILL`` discriminator branch in
  ``renderCugaUserDefinedResponse``, and the related types. The skill
  invocation now surfaces as a "Skill invoked: /<name>" step inside
  the assistant message's reasoning panel — see the follow-up
  ``customSendMessage`` and ``customLoadHistory`` commits.
- ``SlashSuggestionsChip`` is unchanged: unknown-command suggestions
  stay as their own bubble because they are interactive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ti-line SSE parsing

- ``SlashSkillInvoked`` no longer calls ``addMessage`` to mount a
  separate chip bubble; the handler now pushes a
  "Skill invoked: /<name>" step onto ``collectedSteps`` and fires an
  ``addMessageChunk`` partial update on the active streaming shell,
  mirroring the ``CodeAgent_Reasoning`` pattern.
- ``parseCugaStream`` accumulates every ``data:`` line in an event
  block into ``dataLines`` before joining with ``\n``; the previous
  overwrite-per-line silently truncated multi-line bodies to their
  last line, so markdown with blank-line separators arrived clipped.
- ``Answer`` finalisation suppresses the redundant plain-text
  "Unknown command…" fallback when a ``SlashSuggestions`` chip has
  already rendered.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Buffered ``SlashSkillInvoked`` events used to rehydrate as a separate
  USER_DEFINED chip; rebuild them as steps on the next assembled
  assistant message instead, so reloaded threads match the live UX.
- Buffered ``SlashSuggestions`` still rehydrate as their own bubble
  (the chips are clickable; they don't belong in a reasoning panel).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Carbon AI Chat exposes its toggle text via Lit element properties
  (``openLabelText`` / ``closedLabelText``) and matching attributes;
  the ``strings`` prop's i18n dispatch runs after Carbon's React →
  Lit handoff and so does not retro-patch already-mounted toggles.
- Extend the existing MutationObserver (already in place for the
  attachment decorations) to walk ``cds-aichat-reasoning-steps-toggle``
  and ``cds-custom-*`` variants, set both the JS properties and the
  reflected attributes, and reapply on every Carbon rerender so live
  and replayed messages both pick up the new wording.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- CUGA has no app-wide dark mode today, so the
  ``prefers-color-scheme: dark`` overrides on
  ``.cuga-slash-dropdown`` / ``.cuga-slash-option`` /
  ``.cuga-slash-suggestion`` / ``.cuga-slash-skill-chip`` produced
  dark chips inside an otherwise light app.
- Strip the rule bodies; leave a single comment block recording the
  selectors so the styles can be reintroduced verbatim when an
  app-wide dark mode lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Same cleanup as the source-side ``chore(slash)`` commit: remove
  every ``slice #N`` / ``PRD #13`` reference from docstrings and
  inline comments so the tests read as self-describing checks of the
  behaviour they actually exercise.
- No behaviour change, no assertion change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- New focused unit suite for ``StreamEvent.format`` that pins the
  envelope shape ``event: <name>\n`` + one ``data:`` line per logical
  line of the payload + the blank-line terminator.
- Specifically guards against the regression where non-``Answer``
  events under ``OutputFormat.DEFAULT`` returned bare ``self.data``
  (no ``event:`` prefix), which on the wire glued slash events to
  whatever came next.
- Round-trips through ``StreamEvent.parse`` to lock in symmetry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
illeatmyhat and others added 18 commits May 15, 2026 23:53
- The branch is gated on slash_registry.has_skill(parsed.name), and
  has_skill only returns True when the same name is in the registry's
  _by_name dict that entry() reads from. The lookup cannot miss.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- slash_result.suggestions is a dataclass field with default_factory
  =list; the getattr-with-default + 'or []' fallback can never fire.
- The 'and slash_result.injected_messages' clause inside the skill
  branch is implied by kind=="skill": synthesize_skill_invocation
  always returns at least three messages, so injected is invariant
  non-empty when kind=="skill".
- StreamEvent.parse only raises ValueError; the IndexError half of
  except (ValueError, IndexError) was relic of the previous hand-
  rolled string-split parser and is now unreachable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- parse_markdown_with_frontmatter only raises ValueError (yaml.YAMLError
  is wrapped to ValueError inside the parser) and may bubble OSError
  from the underlying open() call. Catching Exception was wide enough
  to swallow real programming errors; narrow to the two reachable
  classes so unexpected exceptions surface as actual crashes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- getBoundingClientRect is a non-optional method on Element (and has
  been since IE5); the ?.() optional chain and the rect-is-null guard
  can never fire. The (el as HTMLElement) cast was also unnecessary
  since the method lives on Element itself.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Event.composedPath is non-optional on Event in lib.dom.d.ts and has
  been universally implemented since 2018. The 'typeof === "function"'
  fallback to [] was guarding a code path no browser this app targets
  can hit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- SlashSuggestionsChipData.suggestions is typed SlashSuggestion[]
  (non-nullable), and both producers always construct it as an array
  via Array.isArray(...) ? ... : []. The !data.suggestions half of
  the guard can never fire; the length===0 early-return stays.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Output of bash build.sh capturing the SlashCommandDropdown,
  SlashChips, and composerTextarea theater removals so the deployed
  bundle drops the unreachable optional-chain / function-type-check
  / null-fallback branches too.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace 17-line "Design" essay with a 3-line summary that names the
  actual constraint (EmbeddingStoreBackend protocol doesn't expose
  embeddings on list).
- Drop the four inline `# 1./2./3./4.` breadcrumbs inside resolve();
  the docstring already enumerates the steps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace the multi-line PRD-quoting cache rationale with a single-line
  comment that conveys the same information.
- Drop the trailing "# only the latest registry snapshot is useful"
  narration after the .clear() call.
- Drop the $ARGUMENTS-substitution restatement comment in the skill
  branch (load_skill(name, args) is self-documenting).
- Drop the cross-reference comment in the skill-load except branch
  (the sibling branch already documents this idiom).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The "deep, pure module ... per PRD #13's testing-decisions section"
paragraph was self-referential ticket metadata, not documentation that
helps a reader of the module.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the multi-paragraph "Why AST instead of regex" comparison with
one sentence stating what the AST walk actually matches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logger.exception in an except branch is a standard idiom — the
preceding three-line rationale was narration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…docstring

- Collapse two adjacent multi-line comments narrating the
  WXO/DEFAULT branching (one about format_event partitioning, one
  about the double-envelope bug) into a single three-line comment
  immediately before the branch.
- Reduce the get_commands docstring to a single sentence; drop the
  kind-discriminator restatement that the response shape already
  expresses.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…comment

The "load_skill message quad" sentence pre-explained an injection
step that lives below; the kind-routing comment above stays.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop the "(it's the tighter, skill-scoped contract)" parenthetical
  from the precedence docstring sentence.
- Collapse the two adjacent comments about unconditional whitelist
  and gated policy into one line.
- Tighten the whitelist-enforcement except comment to one line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the three-line narration about model-initiated calls with one
line stating when substitution runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Collapse composer-discovery + shadow-boundary multi-line comments
  into single-line explanations.
- Drop trivially-named-code comments ("Filter case-insensitively by
  name prefix.", "Clamp highlight index when filtered list shrinks.").
- Drop the ARIA-effect "Safety:" paragraph that preemptively explained
  the observer scope; keep the APG-pattern rationale.
- Tighten the cleanup comment in the ARIA effect to one line.
- Tighten interval-bootstrap comment to one line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- setComposerAriaAttributes: drop "setAttribute on detached nodes is
  safe..." sentence (callers need not guard).
- clearComposerAriaAttributes: drop "removeAttribute on detached node
  is a no-op" sentence (same theme).

Both restate browser behaviour rather than the helper's contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@illeatmyhat illeatmyhat force-pushed the slash-commands-backend branch from b243abe to 842e092 Compare May 16, 2026 07:22
Two-pass audit removing comments that lie about current behaviour,
restate what the code says, or carry developer-diary narration from
the original implementation. ~30 sites across 15 files; no source
behaviour change.

- server/main.py: two comment fixes. (a) A comment claimed
  SlashSkillInvoked renders as a "single collapsed chip" — the chip
  was removed earlier in this PR and the invocation now lives in
  the assistant message's reasoning panel; rewrote to describe
  current behaviour. (b) Removed the embedding-prefetch comment
  that called out a "~120MB FastEmbed download" cost — the function
  actually dispatches to OpenAI or local based on
  settings.storage.embedding.provider, and the prefetch intent is
  already obvious from the function name being called eagerly at
  startup.
- SlashCommandDropdown.tsx: the role-restoration mechanism was
  explained three times in one useEffect; collapsed to one
  canonical spot near the originalRoleRef.
- customLoadHistory.ts / customSendMessage.ts: the
  suppress-next-Answer rationale appeared three times; kept the
  canonical explanation on the flag definition.
- dispatcher.py / command_resolver.py / clear.py /
  arg_substitution.py: removed Design essays, PRD references, and
  what-comments above trivially-named operations.
- allowed_tools_enforcement.py / tool_approval_handler.py /
  loader.py / sdk.py: dropped paragraphs that restate function
  docstrings or paraphrase the very next line.

Verified: ruff clean on touched files; pytest 112 slash unit tests
pass; no frontend rebuild needed (comments don't affect bundles).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@illeatmyhat illeatmyhat force-pushed the slash-commands-backend branch from 842e092 to 273fda7 Compare May 16, 2026 07:34
illeatmyhat and others added 5 commits May 16, 2026 00:52
- Drop open/exec/eval/compile/__import__/breakpoint/globals/vars/locals
  from PYTHON_BUILTINS_SAFELIST via a new RISKY_BUILTINS deny-set. These
  bypassed the spirit of any file-IO-gating ``allowed-tools`` whitelist;
  unwhitelisted calls to them now route through HITL approval like any
  other gated call.
- Reverse the safelist test that asserted ``open`` was auto-allowed, and
  add coverage that bare ``open(...)`` is flagged when not whitelisted.
- Warn when a skill dispatches but ``local_state is None`` — the
  load_skill quad cannot be injected and the planner won't see the body.
  Previously silent.
- Document the resume-path trust boundary: HITL approval is the gate,
  re-enforcing the whitelist mid-turn would not add defense and would
  require persisting the tuple into LangGraph state.
- Clarify DispatchResult.allowed_tools as skill-only; for other kinds
  the field stays ``None``.
- Note that later skill-search roots intentionally override earlier ones
  in ``discover_skills`` so the override semantics aren't mistaken for a
  bug on later reads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Change ``_resolver_cache`` from a misleading ``Dict[int, CommandResolver]``
  to an explicit ``Optional[Tuple[int, CommandResolver]]``. The previous
  shape called ``.clear()`` before every insert, so it was already a size-1
  cache pretending to be a many-entry one. The tuple makes the contract
  visible. Lock + under-lock re-check preserved.
- Embed command names and descriptions as two separate vectors and rank by
  ``max(cos(name), cos(desc))``. Concatenating ``"<name>: <description>"``
  into a single embedding biased ranking toward commands with verbose
  descriptions that semantically overlap with the typo. The max-of-two rule
  lets either half drive ranking on its own merits.
- Commands with an empty description skip the second vector and rank purely
  on the name embedding.
- Update the resolver tests: split the canned vectors so name and
  description halves embed independently, add coverage for
  description-driven ranking and the empty-description path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Escape backticks when rendering ``SlashSkillInvoked`` content in the live
  stream. A literal backtick in raw_input / resolved_name / raw_args would
  close the inline code span and break the markdown. The replay path in
  customLoadHistory.ts already does this; live and replay now render
  identically for any user input.
- Cap the SlashCommandDropdown bootstrap interval at 30 attempts (~15s).
  In embedded hosts, shadow-rooted contexts, or content-blocked pages the
  textarea never resolves and the interval polled forever. After the
  budget we log a one-time warning and stop, so the page no longer holds
  a perpetual setInterval handle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rebuilds ``src/cuga/frontend/dist/`` so the deployed bundle picks up the
SlashCommandDropdown poll bound and the backtick escaping in the live
SlashSkillInvoked reasoning step. Source changes landed in the preceding
commit on the same branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third comment-audit pass on the review-fix range (4eb418f..a0c5787).
Five sites flagged; all reduce duplication of explanations that
already live elsewhere in the same file.

- command_resolver.py: the ``max(cos(name), cos(desc))`` rationale
  was spelled out in the _index field comment, the index() docstring,
  AND the resolve() docstring step 3. Kept the field comment as the
  canonical home; tightened the two docstrings.
- dispatcher.py: dropped the second sentence of the cache comment
  that narrated the old Dict-and-clear shape we just replaced —
  developer-diary noise.
- test_slash_allowed_tools_enforcement.py: tightened two test
  comments that paraphrased the production rationale (test names
  already describe the contract).
- test_slash_command_resolver.py: dropped a preamble that restated
  the test function name.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@illeatmyhat

Copy link
Copy Markdown
Collaborator Author

Superseded by #235 — same head (8105ffe), re-opened directly on cuga-project/cuga-agent so CI gets org secrets (fork-PR runs unconditionally lack base-repo secret access). Will close this once #235's CI is verified green.

illeatmyhat added a commit that referenced this pull request May 17, 2026
First-class slash-command support: three built-ins (/help, /clear, /skills),
invocation of repo-defined skills (e.g. /echo), an autocomplete dropdown,
embedding-based suggestions for typos, and an allowed-tools whitelist
enforced at the HITL approval gate. Skill invocation surfaces as a "Skill
invoked: /<name>" step inside the assistant message's reasoning panel
(toggle relabelled "Show details") instead of a separate chip bubble.

Backend:
- Slash dispatcher, embedding-backed command resolver, allowed-tools gate
  wired into the HITL approval flow, semantic suggestions, GET /api/commands.
- ChatHistoryMessage discriminated-union rehydration so ToolMessage /
  AIMessage.tool_calls round-trip correctly across SSE.
- Per-skill allowed_tools enforced at approval; resume-path inherits the
  same trust boundary.

Frontend:
- Autocomplete dropdown + clickable suggestion chips integrated with
  Carbon's composer textarea (ARIA combobox wiring, thread-rotation
  support, dark mode).
- /clear UX: reset thread state without page reload.
- Reasoning-panel-as-history: slash invocations replay as steps, no
  separate SlashSkillChip element.

Tests:
- 116 unit tests covering resolver, args substitution, allowed-tools
  enforcement, message synthesis.
- Playwright e2e: dropdown, help, skills, echo, suggestions, /clear.

Closes #N/A. Supersedes the fork PR (#230). Re-hosted on the org repo
so CI receives org secrets.
illeatmyhat added a commit that referenced this pull request May 20, 2026
First-class slash-command support: three built-ins (/help, /clear, /skills),
invocation of repo-defined skills (e.g. /echo), an autocomplete dropdown,
embedding-based suggestions for typos, and an allowed-tools whitelist
enforced at the HITL approval gate. Skill invocation surfaces as a "Skill
invoked: /<name>" step inside the assistant message's reasoning panel
(toggle relabelled "Show details") instead of a separate chip bubble.

Backend:
- Slash dispatcher, embedding-backed command resolver, allowed-tools gate
  wired into the HITL approval flow, semantic suggestions, GET /api/commands.
- ChatHistoryMessage discriminated-union rehydration so ToolMessage /
  AIMessage.tool_calls round-trip correctly across SSE.
- Per-skill allowed_tools enforced at approval; resume-path inherits the
  same trust boundary.

Frontend:
- Autocomplete dropdown + clickable suggestion chips integrated with
  Carbon's composer textarea (ARIA combobox wiring, thread-rotation
  support, dark mode).
- /clear UX: reset thread state without page reload.
- Reasoning-panel-as-history: slash invocations replay as steps, no
  separate SlashSkillChip element.

Tests:
- 116 unit tests covering resolver, args substitution, allowed-tools
  enforcement, message synthesis.
- Playwright e2e: dropdown, help, skills, echo, suggestions, /clear.

Closes #N/A. Supersedes the fork PR (#230). Re-hosted on the org repo
so CI receives org secrets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant