Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
0f9001e
docs: add responder (interactive skills) design spec #303
adamdougal May 29, 2026
e940402
docs: add responder (interactive skills) implementation plan #303
adamdougal May 29, 2026
25964ea
feat: add inputs.responder config model #303
adamdougal May 29, 2026
de29bda
feat: validate inputs.responder fields and mutual exclusivity #303
adamdougal May 29, 2026
c32374c
feat: add responder decision types and tools #303
adamdougal May 29, 2026
6732d92
feat: add responder Classifier with persistent session #303
adamdougal May 29, 2026
5740353
feat: add ResponderInfo to RunResult #303
adamdougal May 29, 2026
0670c60
refactor: add injectable responder classifier factory to runner #303
adamdougal May 29, 2026
8c7bedc
feat: drive interactive skills via responder loop #303
adamdougal May 29, 2026
8571614
fix: use persistent responder session with explicit teardown #303
adamdougal May 29, 2026
4cf3968
feat: add inputs.responder to task JSON schema #303
adamdougal May 29, 2026
1355263
docs: document inputs.responder for interactive skills #303
adamdougal May 29, 2026
5b5b0c8
feat: surface responder outcome in dashboard #303
adamdougal May 29, 2026
60e3645
fix: rebuild dashboard bundle and fix lint misspelling #303
adamdougal May 29, 2026
11a66e5
fix: free responder session usage collector on DeleteSession #303
adamdougal May 29, 2026
59c3d51
chore: remove implementation plan
adamdougal May 29, 2026
f539f0b
fix(responder): surface tool errors and drop dead cap-exhausted flag
Copilot Jun 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1193,6 +1193,34 @@ Waza automatically removes each worktree on engine shutdown (`git worktree remov

**Out of scope today** (tracked separately): HTTPS / SSH clone strategies, submodules, Git LFS, and auto-detecting "the repo this test is running in" without an explicit `source`.

## Responder (interactive skills)

For skills that ask follow-up questions, configure a `responder` — an LLM that plays the user and answers the skill's questions. It is mutually exclusive with `follow_up_prompts`.

```yaml
# task.yaml
inputs:
prompt: "Add a new agent to my application"
responder:
model: gpt-4o # optional; defaults to config.model
instructions: |
The agent you want is "research-agent" with system instructions
"Search the web and summarise findings", tools web_search + url_fetch,
and no handoffs. Answer the skill's questions consistently with this.
If you genuinely can't infer an answer, abstain.
max_followups: 8
```

After each agent turn the responder either **replies** (the answer is sent back, continuing the conversation), **stops** (the agent is done), or **abstains** — which fails the run with a distinct `abstained` outcome, signalling the brief is too vague. If `max_followups` is reached while the agent is still asking questions, the loop stops with outcome `cap_exhausted` and graders evaluate the final state. Each task carries its own responder, so the same skill can be tested against several target configurations.

**Fields** (under `inputs.responder`):

| Field | Required | Description |
|---|---|---|
| `instructions` | yes | The target configuration the responder represents and the rule for abstaining. |
| `max_followups` | yes | Maximum number of responder replies before the loop stops (`>= 1`). |
| `model` | no | Model used for the responder LLM. Defaults to the eval-level `config.model`. |

## CI/CD Integration

Waza is designed to work seamlessly with CI/CD pipelines.
Expand Down
345 changes: 345 additions & 0 deletions docs/plans/2026-05-29-responder-interactive-skills.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
# Design: Responder — driving interactive skills via a surrogate LLM

- **Issue:** [#303](https://github.com/microsoft/waza/issues/303)
- **Status:** Approved design, ready for implementation planning
- **Date:** 2026-05-29

## Problem

A growing category of Agent Skills is inherently multi-turn: when the agent
doesn't have everything it needs from the initial prompt, it pauses to ask the
user follow-up questions before completing the task.

**Concrete example — `configure-agent`.** Invoked with "Add a new agent to my
application", the skill must gather a name, system instructions, the set of
tools, and any handoffs before it can generate the agent definition. None of
these can be inferred from the initial prompt, and the structured Q&A *is* part
of the skill's value.

Today, evaluating such a skill in waza forces a bad trade-off: either pre-bake
every answer into the initial prompt (collapsing the skill into a degenerate
one-shot version that no longer tests what ships) or evaluate manually. The
existing static `follow_up_prompts` mechanism only works when the questions —
and their order — are known in advance, which defeats the purpose of testing the
skill's adaptive questioning.

## Goal

Add a **responder**: an LLM-backed surrogate user, configured per task, that
answers the skill's follow-up questions consistently with a described target
configuration. After each agent turn produces chat text, the responder
classifies it into one of three outcomes:

- **Reply** → send the responder's answer back as a new user prompt, decrement a
follow-up budget, and continue.
- **Stop** → the agent is done (no further questions); exit the loop cleanly.
- **Abstain** → the responder explicitly could not answer. Abort the run with a
distinct failure classification, signalling that the brief/skill is too vague
— *not* a transient model timeout or network blip.

## Scope

### In scope

- Per-task `responder` config under `inputs` (sibling to `follow_up_prompts`).
- A responder component that maintains a persistent surrogate-user session and
classifies each agent message via structured tool-calling.
- Runner integration: a responder-driven follow-up loop reusing the existing
agent session and workspace.
- Distinct result tagging for abstain and cap-exhaustion, surfaced in logs,
results JSON, and the dashboard.
- Schema, validation, tests, and documentation.

### Out of scope (possible follow-ups)

- Eval-level responder defaults shared across tasks (each task is self-contained
for now; if many tasks share one target config, the block repeats).
- Per-field override/merge semantics between eval-level and task-level config.

## User-facing surface

The responder is configured per task under `inputs`, alongside the existing
`follow_up_prompts` field. The two are mutually exclusive.

```yaml
inputs:
prompt: "Add a new agent to my application"
responder:
model: gpt-4o # optional; defaults to config.model
instructions: |
You are configuring a new agent inside an agentic application.
The agent you want to create has:
- name: research-agent
- system instructions: "Search the web and summarise findings on the
topic the user provides."
- tools: web_search, url_fetch
- handoffs: none
Answer the skill's questions consistently with this configuration,
regardless of the order in which the skill asks for each piece.
If you genuinely can't infer an answer from the above, abstain.
max_followups: 8
```

Because the responder lives per task, the same skill can be exercised against
several target configurations (a research agent, a customer-support agent, a
triage agent with handoffs) by giving each task its own `responder.instructions`
— this is exactly the robustness testing the issue calls for, achieved without
any eval-level override machinery.

### Configuration fields

| Field | Required | Default | Notes |
|----------------|----------|------------------|-----------------------------------------|
| `model` | no | `config.model` | Model used for the responder LLM. |
| `instructions` | yes | — | Describes the target config + abstain rule. |
| `max_followups`| yes | — | Must be `>= 1`. Caps responder replies. |

## Architecture

The design reuses two patterns already proven in the codebase:

1. **LLM-backed classification via structured tool-calling**, as used by the
prompt grader (`internal/graders/prompt_grader.go`), against the narrow
`Executor` interface (`Execute(ctx, *ExecutionRequest)`).
2. **Multi-turn agent follow-ups via session + workspace reuse**, as used by the
existing static follow-up loop (`executeFollowUps` in
`internal/orchestration/runner.go`), which resumes the agent session by
passing `SessionID` and `WorkspaceDir` on each `Execute`.

The responder owns classification; the runner owns the loop and all agent
follow-up plumbing (per-turn timeout, event/usage/tool-call merging).

```mermaid
sequenceDiagram
participant R as Runner (executeResponderLoop)
participant A as Agent session (engine)
participant C as Responder Classifier
participant S as Responder session (engine)

R->>A: initial prompt (Execute)
A-->>R: agent chat text (FinalOutput)
loop while budget > 0
R->>C: Classify(agentMessage)
C->>S: agent question (Execute, persistent session + decision tools)
S-->>C: tool call: respond / stop / abstain
C-->>R: Decision
alt reply
R->>A: follow-up = answer (Execute, reuse SessionID + WorkspaceDir)
A-->>R: agent chat text
Note over R: budget--
else stop
Note over R: outcome = stopped; break
else abstain
Note over R: outcome = abstained (StatusError); break
end
end
Note over R: budget exhausted while still replying → outcome = cap_exhausted
R->>R: run graders against final state
```

### Component 1: Config model (`internal/models`)

A new `ResponderConfig` carried on `TaskStimulus` (the `inputs` block):

```go
type ResponderConfig struct {
Model string `yaml:"model,omitempty" json:"model,omitempty"`
Instructions string `yaml:"instructions" json:"instructions"`
MaxFollowups int `yaml:"max_followups" json:"max_followups"`
}

// TaskStimulus gains:
// Responder *ResponderConfig `yaml:"responder,omitempty" json:"responder,omitempty"`
```

**Validation** (in `TestCase.Validate`, surfaced by `LoadTestCase`):

- If `Responder != nil`:
- `Instructions` must be non-empty.
- `MaxFollowups >= 1`.
- `FollowUps` must be empty (mutual exclusivity; clear error message naming
both fields).

### Component 2: Responder package (`internal/responder`)

```go
type DecisionKind int
const (
DecisionReply DecisionKind = iota
DecisionStop
DecisionAbstain
)

type Decision struct {
Kind DecisionKind
Answer string // set when Kind == DecisionReply
Reason string // set when Kind == DecisionAbstain
}

// Executor is the narrow execution surface the responder needs (same shape as
// graders.Executor), enabling unit tests with a fake executor.
type Executor interface {
Execute(ctx context.Context, req *execution.ExecutionRequest) (*execution.ExecutionResponse, error)
}

type Classifier struct {
exec Executor
model string
instructions string
sessionID string // empty until the first Classify creates the session
}

func New(exec Executor, cfg models.ResponderConfig, defaultModel string) *Classifier
func (c *Classifier) Classify(ctx context.Context, agentMessage string) (Decision, error)
```

`Classify` behaviour:

- **Persistent session.** The first call creates the responder session (no
resume `SessionID`); the returned `SessionID` is stored and passed on every
subsequent call so the responder accumulates the back-and-forth like a real
user. The session is owned by the engine and cleaned up at `Shutdown`.
- **First message** carries the responder `instructions` as a preamble plus the
agent's first question and a directive to answer by calling exactly one
decision tool. **Later messages** carry only the agent's latest question
(instructions persist in session context).
- **Structured output** via three tools whose handlers capture the decision:
- `respond(answer: string)` → `DecisionReply`
- `stop()` → `DecisionStop`
- `abstain(reason: string)` → `DecisionAbstain`
- Request uses `NoSkills: true`, `MessageMode: MessageModeEnqueue`,
`Streaming: true`. The responder session does **not** use the agent's
workspace.
- If no decision tool is called (responder malfunction), `Classify` returns an
error — distinct from abstain.

### Component 3: Runner loop (`internal/orchestration/runner.go`)

In `executeRun`, after the initial `Execute`:

- If `tc.Stimulus.Responder != nil` → `executeResponderLoop`.
- Else if `len(tc.Stimulus.FollowUps) > 0` → existing `executeFollowUps`.

`executeResponderLoop` mirrors `executeFollowUps` plumbing (build request via
`buildExecutionRequest`, set `Message`/`SessionID`/`WorkspaceDir`, apply per-turn
timeout, merge `Events`/`ToolCalls`/`SkillInvocations`/`DurationMs`/`FinalOutput`/
`WorkspaceFiles`/`Usage` into `resp`). Pseudocode:

```
classifier := responder.New(r.engine, *tc.Stimulus.Responder, r.spec.Config.ModelID)
left := tc.Stimulus.Responder.MaxFollowups
sent := 0
outcome := "completed"
for left > 0 {
decision, err := classifier.Classify(ctx, resp.FinalOutput)
if err != nil { resp.ErrorMsg = "responder error: " + err; outcome = "error"; break }
switch decision.Kind {
case Reply:
// send agent follow-up using decision.Answer (reuse SessionID + WorkspaceDir)
// merge follow-up response into resp; on error set resp.ErrorMsg + break
sent++; left--
log: responder replied (turn sent, budget left)
case Stop:
outcome = "stopped"; goto done
case Abstain:
resp.ErrorMsg = "responder abstained: " + decision.Reason
outcome = "abstained"; goto done
}
}
if left == 0 && lastDecisionWasReply {
outcome = "cap_exhausted"
log warning: responder budget exhausted while agent still asking
}
done:
attach ResponderInfo{FollowupsSent: sent, Outcome: outcome, Reason: ...} to the run
```

Verbose mode emits per-turn progress events (reusing the existing
`EventAgentPrompt` / `EventAgentResponse` style) so `-v` runs show the
responder's answers and the agent's replies.

### Component 4: Results & reporting (`internal/models/outcome.go`)

```go
type ResponderInfo struct {
FollowupsSent int `json:"followups_sent"`
Outcome string `json:"outcome"` // completed|stopped|abstained|cap_exhausted|error
Reason string `json:"reason,omitempty"`
}

// RunResult gains:
// Responder *ResponderInfo `json:"responder,omitempty"`
```

Status mapping:

| Responder outcome | `RunResult.Status` | `ErrorMsg` | Notes |
|-------------------|--------------------|------------------------------------|-------|
| `completed` | unchanged (graded) | — | Agent finished; graders decide pass/fail. |
| `stopped` | unchanged (graded) | — | Responder signalled done. |
| `abstained` | `StatusError` | `responder abstained: <reason>` | Distinct, filterable; separate from timeouts/network errors. |
| `cap_exhausted` | unchanged (graded) | — | Logged + surfaced; graders judge the end state. |
| `error` | `StatusError` | `responder error: <msg>` | Responder malfunction (no decision / session failure). |

Because abstain reuses `StatusError` but is tagged via `Responder.Outcome`,
reports and the dashboard can distinguish a vague-brief abstain from a genuine
error. The dashboard (`web/`) surfaces `responder.outcome` (and reason) so
abstain and cap-exhaustion are visible per run.

## Error handling & edge cases

- **No decision tool called** → `Classify` error → run `error` outcome
(`StatusError`), distinct from abstain.
- **Responder session creation/Execute failure** → propagated as run `error`.
- **Agent follow-up Execute failure** → mirrors `executeFollowUps`: set
`resp.ErrorMsg`, stop the loop.
- **`max_followups` exhausted while agent still asking** → `cap_exhausted`; loop
stops, run proceeds to grading, warning logged.
- **Mutual exclusivity** of `responder` and `follow_up_prompts` enforced at load
time with a clear error.
- **Context cancellation / task timeout** honoured on every responder and agent
turn via the existing per-turn timeout pattern.

## Testing

- **`internal/responder`** — fake `Executor` invoking decision-tool handlers:
reply / stop / abstain / no-decision-error; persistent-session resumption
(second `Classify` passes the stored `SessionID`); first-vs-later message
shape (instructions preamble only on first call); model defaulting.
- **`internal/orchestration`** — mock engine + injectable classifier (or fake
executor): reply → agent follow-up sent with reused session/workspace; stop;
abstain → `StatusError` + `Responder.Outcome == "abstained"`; cap exhaustion →
graded + `Responder.Outcome == "cap_exhausted"`; mutual-exclusivity rejection.
- **`internal/models`** — validation: missing instructions, `max_followups < 1`,
both `responder` and `follow_up_prompts` set.
- **Schema** — `internal/validation` and `internal/projectconfig` parity tests
for the new `responder` field.
- All existing tests remain green; `go test ./...` and `golangci-lint run` pass.

## Documentation

Per `AGENTS.md`:

- `README.md` — responder section + YAML example in the eval/inputs docs.
- `site/src/content/docs/` — eval-YAML reference entry for `inputs.responder`
and a short guide on testing interactive skills; build with `npm run build`.
- Schema files kept in sync.
- Dashboard (`web/`) — surface `responder.outcome`/`reason`; regenerate
screenshots if UI changes.
- Reference issue #303 in commits; update tracking issue #66 if applicable.

## Rationale

- **Per-task placement** mirrors `follow_up_prompts`, keeps each task
self-contained, makes mutual-exclusivity checking local, and directly serves
the "vary the target config across tasks" use case — without any eval-level
override/merge complexity.
- **Runner owns the loop, responder owns classification** keeps the responder
small and unit-testable, and reuses the battle-tested agent follow-up plumbing
rather than duplicating it.
- **Persistent responder session** models a real user who remembers prior
answers, avoiding contradictory or repeated responses across turns.
- **Abstain as tagged `StatusError`** satisfies the issue's requirement that a
vague-brief abstain be reportable separately from transient errors, without
introducing a new top-level status value that every report/consumer would need
to learn.
Loading
Loading