diff --git a/README.md b/README.md new file mode 100644 index 0000000..f49a57f --- /dev/null +++ b/README.md @@ -0,0 +1,167 @@ +# ToolGate + +ToolGate is an MCP gateway that enforces policy on every tool call an AI agent makes — logging decisions, requiring human approval for sensitive operations, and surfacing clean errors when upstream services fail. + +## Prerequisites + +- Docker + Docker Compose +- Go 1.22+ +- `ANTHROPIC_API_KEY` set in your environment (or in `.env`) + +## Quick start — resilience demo UI + +The demo UI lets you run three fault-injection scenarios against a live stack and watch the audit trail update in real time. + +### 1. Build the gateway binary + +The compose stack mounts a pre-built binary instead of compiling inside Docker: + +```bash +make build-compose-bins +``` + +### 2. Start the full stack + +```bash +source .env # loads ANTHROPIC_API_KEY and optional overrides +docker compose up -d --wait +``` + +Services started: + +| Service | Host port | Purpose | +|---|---|---| +| `gateway` | 18080 | ToolGate MCP gateway | +| `localstripe` | 18420 | Fake Stripe API | +| `localstripe-mcp` | 18421 | MCP server wrapping localstripe | +| `eval-trigger` | 18086 | Python agent that the eval runner drives | +| `mock-slack` | 18090 | Fake Slack (receives approval requests) | +| `postgres` | 15432 | Audit log store | + +### 3. Start the eval runner UI + +```bash +POSTGRES_DSN="postgres://gateway:gateway@127.0.0.1:15432/gateway?sslmode=disable" \ +AGENT_URL="http://127.0.0.1:18086" \ +go run ./cmd/eval-runner --serve evalsuite/resilience.yaml +``` + +Open **http://localhost:8099** in your browser. + +--- + +## Running the three scenarios + +Each scenario requires a specific stack state. The **Stack Health** panel in the UI shows the current state of each service — use **Refresh Health** before running. + +### Scenario 1 — MCP Crash + +**What it tests:** Gateway surfaces a clean `upstream_error` when the upstream MCP server is unavailable. + +**Required state:** Gateway up, MCP down, Slack any, Postgres up. + +```bash +# Warm the gateway capability cache while MCP is healthy +SESSION=$(curl -s -D - -X POST http://localhost:18080/mcp \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":0,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"warmup","version":"1.0"}}}' \ + | grep -i "^Mcp-Session-Id:" | awk '{print $2}' | tr -d '\r\n') +curl -s -X POST http://localhost:18080/mcp \ + -H "Content-Type: application/json" \ + -H "Mcp-Session-Id: $SESSION" \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' > /dev/null + +# Inject the fault +docker compose stop localstripe-mcp +``` + +Click **MCP Crash → Run Scenario**. + +**Expected result:** `list_recent_charges → allow → upstream_error` — the gateway served the tool list from its capability cache and recorded the upstream failure. + +### Scenario 2 — Retry Storm + +**What it tests:** Budget limiter stops an agent from hammering a downed service. + +**Required state:** Gateway up, MCP down (carry over from Scenario 1). + +No additional setup needed. Click **Retry Storm → Run Scenario**. + +**Expected result:** Five `allow` decisions followed by `budgetExceeded`. + +### Scenario 3 — Approval Timeout + +**What it tests:** An `approvalRequired` decision expires gracefully when Slack is unreachable. + +**Required state:** Gateway up, MCP up, Slack down, Postgres up. + +```bash +# Restore MCP +docker compose start localstripe-mcp + +# Wait for it to become healthy, then seed demo charges for alice@example.com +until docker inspect toolgate-localstripe-mcp-1 \ + --format '{{.State.Health.Status}}' 2>/dev/null | grep -q healthy; do sleep 2; done + +docker exec toolgate-eval-trigger-1 python3 -c " +import asyncio, sys +sys.path.insert(0, '/app') +from demo_webapp.stripe_client import StripeClient +from demo_webapp.seed import seed_demo_customer + +async def main(): + client = StripeClient('http://localstripe:8420', 'sk_test_12345') + cust = await client.find_customer_by_email('alice@example.com') + if cust is None: + cust = await client.create_customer('alice@example.com', 'Alice') + await seed_demo_customer(client, cust['id']) + await client.aclose() + +asyncio.run(main()) +" + +# Re-warm gateway after MCP restart +SESSION=$(curl -s -D - -X POST http://localhost:18080/mcp \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":0,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"warmup","version":"1.0"}}}' \ + | grep -i "^Mcp-Session-Id:" | awk '{print $2}' | tr -d '\r\n') +curl -s -X POST http://localhost:18080/mcp \ + -H "Content-Type: application/json" \ + -H "Mcp-Session-Id: $SESSION" \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' > /dev/null + +# Stop Slack +docker compose stop mock-slack +``` + +Click **Approval Timeout → Run Scenario**. The case waits ~15 s for the approval TTL to expire. + +**Expected result:** `list_recent_charges → allow`, `create_refund → approvalRequired → expired`. + +--- + +## Scripted end-to-end run + +To run all three scenarios headlessly in one shot: + +```bash +make demo-resilience +``` + +This script manages the full Docker lifecycle, runs each scenario in sequence, and tears down the stack on exit. + +--- + +## Gateway capability cache + +The gateway caches the last successful `initialize` and `tools/list` responses from the upstream MCP server. When the upstream is unavailable, it serves tool metadata from this cache so agents can still discover tools — requests then fail with `upstream_error` at the call site rather than at tool-list time. + +**Important:** the cache is populated the first time a successful `tools/list` reaches the gateway. Always warm it (see Scenario 1 setup above) before stopping the MCP server. + +--- + +## Teardown + +```bash +docker compose down -v # stops all services and removes volumes +``` diff --git a/cmd/eval-runner/evaluator.go b/cmd/eval-runner/evaluator.go index 9702202..77cd914 100644 --- a/cmd/eval-runner/evaluator.go +++ b/cmd/eval-runner/evaluator.go @@ -3,7 +3,7 @@ package main import "strings" func Evaluate(c EvalCase, trace []TraceRow) CaseResult { - result := CaseResult{Name: c.Name} + result := CaseResult{Name: c.Name, Trace: trace} failures := make([]CheckFailure, 0) failures = append(failures, evaluateMustInclude(c.MustInclude, trace)...) diff --git a/cmd/eval-runner/evaluator_test.go b/cmd/eval-runner/evaluator_test.go index 371cc16..39ac362 100644 --- a/cmd/eval-runner/evaluator_test.go +++ b/cmd/eval-runner/evaluator_test.go @@ -25,12 +25,53 @@ func TestEvaluatePassesWhenAllChecksMatch(t *testing.T) { Name: "small-refund-allow", Passed: true, Failures: nil, + Trace: trace, } if !reflect.DeepEqual(got, want) { t.Fatalf("Evaluate() = %#v, want %#v", got, want) } } +func TestEvaluateIncludesTraceOnPass(t *testing.T) { + testCase := EvalCase{ + Name: "trace-pass", + MustInclude: []string{"lookup_customer"}, + PolicyOutcome: "allow", + } + trace := []TraceRow{ + {ToolName: "lookup_customer", Decision: "allow", Arguments: json.RawMessage(`{"customer":"abc"}`)}, + } + + got := Evaluate(testCase, trace) + + if !got.Passed { + t.Fatalf("Evaluate() Passed = false, want true; failures = %#v", got.Failures) + } + if !reflect.DeepEqual(got.Trace, trace) { + t.Fatalf("Trace = %#v, want %#v", got.Trace, trace) + } +} + +func TestEvaluateIncludesTraceOnFailure(t *testing.T) { + testCase := EvalCase{ + Name: "trace-fail", + MustInclude: []string{"create_refund"}, + PolicyOutcome: "allow", + } + trace := []TraceRow{ + {ToolName: "lookup_customer", Decision: "allow"}, + } + + got := Evaluate(testCase, trace) + + if got.Passed { + t.Fatal("Evaluate() Passed = true, want false") + } + if !reflect.DeepEqual(got.Trace, trace) { + t.Fatalf("Trace = %#v, want %#v", got.Trace, trace) + } +} + func TestEvaluateMustIncludeAllowsGaps(t *testing.T) { testCase := EvalCase{ Name: "gapped-subsequence", diff --git a/cmd/eval-runner/execution.go b/cmd/eval-runner/execution.go new file mode 100644 index 0000000..0dcbd77 --- /dev/null +++ b/cmd/eval-runner/execution.go @@ -0,0 +1,36 @@ +package main + +import "context" + +func runEvalCase(ctx context.Context, runner caseExecutor, testCase EvalCase) CaseResult { + trace, err := runner.Run(ctx, testCase) + if err != nil { + return CaseResult{ + Name: testCase.Name, + Failures: []CheckFailure{{ + Check: "run", + Expected: "case completes successfully", + Observed: err.Error(), + }}, + } + } + + return Evaluate(testCase, trace) +} + +func summarizeResults(results []CaseResult) evalResponse { + passCount := 0 + for _, result := range results { + if result.Passed { + passCount++ + } + } + + return evalResponse{ + Passed: passCount == len(results), + PassCount: passCount, + TotalCount: len(results), + Cases: results, + Report: GenerateReport(results), + } +} diff --git a/cmd/eval-runner/scenarios.go b/cmd/eval-runner/scenarios.go new file mode 100644 index 0000000..063aac2 --- /dev/null +++ b/cmd/eval-runner/scenarios.go @@ -0,0 +1,444 @@ +package main + +import ( + "bytes" + "context" + "encoding/json" + "errors" + "fmt" + "log/slog" + "net/http" + "net/url" + "os" + "strings" + "time" + + "github.com/jackc/pgx/v5/pgxpool" +) + +const defaultGatewayMCPURL = "http://localhost:18080/mcp" + +const mcpCrashSuiteYAML = `cases: + - name: mcp-server-down + input: "Show me my recent charges." + mustInclude: + - list_recent_charges + policyOutcome: upstream_error +` + +const approvalTimeoutSuiteYAML = `cases: + - name: approval-timeout-slack-down + input: > + List recent charges for alice@example.com, then issue a full refund on + the first non-refunded charge with reason requested_by_customer. + Do not ask for confirmation — proceed directly. + mustInclude: + - list_recent_charges + - create_refund + policyOutcome: expired +` + +type scenarioDeps struct { + pool *pgxpool.Pool + newRunner runnerFactory + newRetryStorm func(gatewayURL string) scenarioCaseExecutor + defaultAgentURL string + defaultGatewayMCPURL string +} + +type scenarioCaseExecutor interface { + Run(ctx context.Context) CaseResult +} + +type scenarioCaseExecutorFunc func(context.Context) CaseResult + +func (f scenarioCaseExecutorFunc) Run(ctx context.Context) CaseResult { + return f(ctx) +} + +type retryStormExecutor struct { + gatewayMCPURL string + pool *pgxpool.Pool + client *http.Client + initialize func(context.Context) (string, error) + callGateway func(context.Context, string, string) (string, error) + queryTrace func(context.Context, string) ([]TraceRow, error) + newTurnID func() string + pollInterval time.Duration + pollTimeout time.Duration +} + +func makeScenarioStreamHandler(deps scenarioDeps) http.HandlerFunc { + return func(w http.ResponseWriter, r *http.Request) { + var body struct { + ScenarioID string `json:"scenario_id"` + AgentURL string `json:"agent_url"` + GatewayMCPURL string `json:"gateway_mcp_url"` + } + if err := json.NewDecoder(r.Body).Decode(&body); err != nil { + http.Error(w, fmt.Sprintf("invalid request: %v", err), http.StatusBadRequest) + return + } + if _, ok := w.(http.Flusher); !ok { + http.Error(w, "streaming unsupported", http.StatusInternalServerError) + return + } + + switch body.ScenarioID { + case "mcp-crash", "approval-timeout": + agentURL, err := resolveAbsoluteURL(body.AgentURL, deps.defaultAgentURL) + if err != nil { + http.Error(w, "missing or invalid agent_url", http.StatusBadRequest) + return + } + if deps.newRunner == nil { + http.Error(w, "runner unavailable", http.StatusInternalServerError) + return + } + suiteYAML := mcpCrashSuiteYAML + if body.ScenarioID == "approval-timeout" { + suiteYAML = approvalTimeoutSuiteYAML + } + suite, err := LoadSuiteFromReader(strings.NewReader(suiteYAML)) + if err != nil { + http.Error(w, fmt.Sprintf("invalid scenario suite: %v", err), http.StatusInternalServerError) + return + } + w.Header().Set("Content-Type", "text/event-stream") + w.Header().Set("Cache-Control", "no-cache") + streamEvalSuite(r.Context(), w, deps.newRunner(agentURL), suite.Cases) + case "retry-storm": + gatewayURL, err := resolveGatewayMCPURL(body.GatewayMCPURL, deps.defaultGatewayMCPURL) + if err != nil { + http.Error(w, "missing or invalid gateway_mcp_url", http.StatusBadRequest) + return + } + if deps.newRetryStorm == nil { + http.Error(w, "retry storm executor unavailable", http.StatusInternalServerError) + return + } + w.Header().Set("Content-Type", "text/event-stream") + w.Header().Set("Cache-Control", "no-cache") + + if err := writeSSE(w, "case_start", caseStartEvent{Name: "retry-storm-budget", Index: 0, Total: 1}); err != nil { + return + } + result := deps.newRetryStorm(gatewayURL).Run(r.Context()) + if err := writeSSE(w, "case_result", caseResultEvent{Index: 0, Total: 1, Result: result}); err != nil { + return + } + _ = writeSSE(w, "summary", summarizeResults([]CaseResult{result})) + default: + http.Error(w, "unknown scenario_id", http.StatusBadRequest) + } + } +} + +// warmGatewayCapCache calls initialize + tools/list on the gateway so the +// capability cache is populated before localstripe-mcp is stopped for the +// MCP Crash scenario. Failures are logged and ignored — the cache may already +// be warm from a prior run. +func warmGatewayCapCache(gatewayMCPURL string) { + client := &http.Client{Timeout: 5 * time.Second} + + initBody, _ := json.Marshal(map[string]any{ + "jsonrpc": "2.0", "id": 0, "method": "initialize", + "params": map[string]any{ + "protocolVersion": "2025-03-26", + "capabilities": map[string]any{}, + "clientInfo": map[string]any{"name": "eval-warmup", "version": "1.0"}, + }, + }) + req, err := http.NewRequest(http.MethodPost, gatewayMCPURL, bytes.NewReader(initBody)) + if err != nil { + slog.Warn("gateway warmup: build initialize request", "err", err) + return + } + req.Header.Set("Content-Type", "application/json") + resp, err := client.Do(req) + if err != nil { + slog.Warn("gateway warmup: initialize failed", "err", err) + return + } + sessionID := resp.Header.Get("Mcp-Session-Id") + _ = resp.Body.Close() + if resp.StatusCode != http.StatusOK || sessionID == "" { + slog.Warn("gateway warmup: initialize returned unexpected status", "status", resp.StatusCode) + return + } + + listBody, _ := json.Marshal(map[string]any{ + "jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": map[string]any{}, + }) + req2, err := http.NewRequest(http.MethodPost, gatewayMCPURL, bytes.NewReader(listBody)) + if err != nil { + slog.Warn("gateway warmup: build tools/list request", "err", err) + return + } + req2.Header.Set("Content-Type", "application/json") + req2.Header.Set("Mcp-Session-Id", sessionID) + resp2, err := client.Do(req2) + if err != nil { + slog.Warn("gateway warmup: tools/list failed", "err", err) + return + } + _ = resp2.Body.Close() + slog.Info("gateway warmup: capability cache primed", "gateway", gatewayMCPURL) +} + +func resolveAbsoluteURL(requestValue, fallback string) (string, error) { + candidate := strings.TrimSpace(requestValue) + if candidate == "" { + candidate = strings.TrimSpace(fallback) + } + if candidate == "" { + return "", errors.New("missing url") + } + parsed, err := url.Parse(candidate) + if err != nil || parsed.Scheme == "" || parsed.Host == "" { + return "", errors.New("invalid url") + } + if parsed.Scheme != "http" && parsed.Scheme != "https" { + return "", errors.New("invalid url") + } + return candidate, nil +} + +func resolveGatewayMCPURL(requestValue, fallback string) (string, error) { + if strings.TrimSpace(requestValue) != "" { + return resolveAbsoluteURL(requestValue, "") + } + if resolved, err := resolveAbsoluteURL("", fallback); err == nil { + return resolved, nil + } + if resolved, err := resolveAbsoluteURL("", os.Getenv("GATEWAY_MCP_URL")); err == nil { + return resolved, nil + } + return resolveAbsoluteURL("", defaultGatewayMCPURL) +} + +func newRetryStormExecutor(gatewayURL string, pool *pgxpool.Pool) scenarioCaseExecutor { + exec := &retryStormExecutor{ + gatewayMCPURL: gatewayURL, + pool: pool, + client: &http.Client{ + Timeout: caseRunnerHTTPTimeout, + }, + pollInterval: auditPollInterval, + pollTimeout: auditPollTimeout, + newTurnID: func() string { + return fmt.Sprintf("retry-storm-%d", time.Now().UnixNano()) + }, + } + exec.initialize = exec.defaultInitialize + exec.callGateway = exec.defaultCallGateway + exec.queryTrace = exec.defaultQueryTrace + return exec +} + +func (e *retryStormExecutor) Run(ctx context.Context) CaseResult { + result := CaseResult{Name: "retry-storm-budget"} + if e.newTurnID == nil { + e.newTurnID = func() string { + return fmt.Sprintf("retry-storm-%d", time.Now().UnixNano()) + } + } + if e.pollInterval == 0 { + e.pollInterval = auditPollInterval + } + if e.pollTimeout == 0 { + e.pollTimeout = auditPollTimeout + } + sessionID, err := e.initialize(ctx) + if err != nil { + result.Failures = []CheckFailure{{ + Check: "run", + Expected: "retry storm completes successfully", + Observed: err.Error(), + }} + return result + } + if strings.TrimSpace(sessionID) == "" { + result.Failures = []CheckFailure{{ + Check: "run", + Expected: "Mcp-Session-Id response header", + Observed: "(empty session id)", + }} + return result + } + + turnID := e.newTurnID() + budgetResponse := false + for i := 0; i < 6; i++ { + respBody, err := e.callGateway(ctx, sessionID, turnID) + if err != nil { + result.Failures = []CheckFailure{{ + Check: "run", + Expected: "gateway tool call succeeds", + Observed: err.Error(), + }} + return result + } + if strings.Contains(strings.ToLower(respBody), "budget") { + budgetResponse = true + break + } + } + + if !budgetResponse { + result.Failures = []CheckFailure{{ + Check: "policyOutcome", + Expected: "budgetExceeded", + Observed: "no budget limiter response after 6 calls", + }} + return result + } + + deadline := time.Now().Add(e.pollTimeout) + var trace []TraceRow + for { + trace, err = e.queryTrace(ctx, sessionID) + if err != nil { + result.Failures = []CheckFailure{{ + Check: "run", + Expected: "audit query succeeds", + Observed: err.Error(), + }} + return result + } + result.Trace = trace + if hasDecision(trace, "budgetExceeded") { + result.Passed = true + return result + } + if time.Now().After(deadline) { + result.Failures = []CheckFailure{{ + Check: "policyOutcome", + Expected: "budgetExceeded", + Observed: lastDecision(trace), + }} + return result + } + + select { + case <-ctx.Done(): + result.Failures = []CheckFailure{{ + Check: "run", + Expected: "context remains active", + Observed: ctx.Err().Error(), + }} + return result + case <-time.After(e.pollInterval): + } + } +} + +func hasDecision(trace []TraceRow, decision string) bool { + for _, row := range trace { + if row.Decision == decision { + return true + } + } + return false +} + +func lastDecision(trace []TraceRow) string { + if len(trace) == 0 { + return "(empty trace)" + } + return trace[len(trace)-1].Decision +} + +func (e *retryStormExecutor) defaultInitialize(ctx context.Context) (string, error) { + payload := map[string]any{ + "jsonrpc": "2.0", + "id": 0, + "method": "initialize", + "params": map[string]any{ + "protocolVersion": "2025-03-26", + "capabilities": map[string]any{}, + "clientInfo": map[string]any{ + "name": "retry-storm-ui", + "version": "1.0", + }, + }, + } + body, err := json.Marshal(payload) + if err != nil { + return "", fmt.Errorf("marshal initialize request: %w", err) + } + req, err := http.NewRequestWithContext(ctx, http.MethodPost, e.gatewayMCPURL, bytes.NewReader(body)) + if err != nil { + return "", fmt.Errorf("build initialize request: %w", err) + } + req.Header.Set("Content-Type", "application/json") + resp, err := e.client.Do(req) + if err != nil { + return "", err + } + defer func() { _ = resp.Body.Close() }() + if resp.StatusCode != http.StatusOK { + return "", fmt.Errorf("initialize returned HTTP %d: %s", resp.StatusCode, firstBytes(resp.Body, 256)) + } + return resp.Header.Get("Mcp-Session-Id"), nil +} + +func (e *retryStormExecutor) defaultCallGateway(ctx context.Context, sessionID, turnID string) (string, error) { + payload := map[string]any{ + "jsonrpc": "2.0", + "id": 1, + "method": "tools/call", + "params": map[string]any{ + "name": "list_recent_charges", + "arguments": map[string]any{}, + }, + } + body, err := json.Marshal(payload) + if err != nil { + return "", fmt.Errorf("marshal tools/call request: %w", err) + } + req, err := http.NewRequestWithContext(ctx, http.MethodPost, e.gatewayMCPURL, bytes.NewReader(body)) + if err != nil { + return "", fmt.Errorf("build tools/call request: %w", err) + } + req.Header.Set("Content-Type", "application/json") + req.Header.Set("Mcp-Session-Id", sessionID) + req.Header.Set("X-Mcp-Turn-Id", turnID) + resp, err := e.client.Do(req) + if err != nil { + return "", err + } + defer func() { _ = resp.Body.Close() }() + return firstBytes(resp.Body, 4096), nil +} + +func (e *retryStormExecutor) defaultQueryTrace(ctx context.Context, sessionID string) ([]TraceRow, error) { + if e.pool == nil { + return nil, errors.New("postgres pool is nil") + } + rows, err := e.pool.Query( + ctx, + `SELECT tool_name, decision, arguments + FROM audit_log + WHERE session_id = $1 + ORDER BY decided_at ASC`, + sessionID, + ) + if err != nil { + return nil, err + } + defer rows.Close() + + var trace []TraceRow + for rows.Next() { + var row TraceRow + if err := rows.Scan(&row.ToolName, &row.Decision, &row.Arguments); err != nil { + return nil, err + } + trace = append(trace, row) + } + if err := rows.Err(); err != nil { + return nil, err + } + return trace, nil +} diff --git a/cmd/eval-runner/scenarios_test.go b/cmd/eval-runner/scenarios_test.go new file mode 100644 index 0000000..ebb4331 --- /dev/null +++ b/cmd/eval-runner/scenarios_test.go @@ -0,0 +1,296 @@ +package main + +import ( + "context" + "errors" + "net/http" + "net/http/httptest" + "strings" + "testing" + "time" +) + +func TestScenarioStreamRequiresAgentURLForYAMLScenario(t *testing.T) { + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"mcp-crash"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{})(rec, req) + + if rec.Code != http.StatusBadRequest { + t.Fatalf("status = %d, want 400", rec.Code) + } +} + +func TestScenarioStreamRetryStormRequiresGatewayURLOnly(t *testing.T) { + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"retry-storm","gateway_mcp_url":"://bad"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{})(rec, req) + + if rec.Code != http.StatusBadRequest { + t.Fatalf("status = %d, want 400", rec.Code) + } +} + +func TestScenarioStreamRetryStormUsesGatewayDefaultWithoutAgentURL(t *testing.T) { + ran := false + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"retry-storm"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{ + defaultGatewayMCPURL: "http://gateway.example/mcp", + newRetryStorm: func(gatewayURL string) scenarioCaseExecutor { + if gatewayURL != "http://gateway.example/mcp" { + t.Fatalf("gatewayURL = %q, want default", gatewayURL) + } + return scenarioCaseExecutorFunc(func(context.Context) CaseResult { + ran = true + return CaseResult{Name: "retry-storm-budget", Passed: true} + }) + }, + })(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200: %s", rec.Code, rec.Body.String()) + } + if !ran { + t.Fatal("retry storm executor did not run") + } +} + +func TestScenarioStreamRetryStormRequestGatewayURLWins(t *testing.T) { + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"retry-storm","gateway_mcp_url":"http://request.example/mcp"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{ + defaultGatewayMCPURL: "http://default.example/mcp", + newRetryStorm: func(gatewayURL string) scenarioCaseExecutor { + if gatewayURL != "http://request.example/mcp" { + t.Fatalf("gatewayURL = %q, want request override", gatewayURL) + } + return scenarioCaseExecutorFunc(func(context.Context) CaseResult { + return CaseResult{Name: "retry-storm-budget", Passed: true} + }) + }, + })(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200: %s", rec.Code, rec.Body.String()) + } +} + +func TestScenarioStreamRetryStormUsesEnvGatewayURL(t *testing.T) { + t.Setenv("GATEWAY_MCP_URL", "http://env.example/mcp") + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"retry-storm"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{ + newRetryStorm: func(gatewayURL string) scenarioCaseExecutor { + if gatewayURL != "http://env.example/mcp" { + t.Fatalf("gatewayURL = %q, want env fallback", gatewayURL) + } + return scenarioCaseExecutorFunc(func(context.Context) CaseResult { + return CaseResult{Name: "retry-storm-budget", Passed: true} + }) + }, + })(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200: %s", rec.Code, rec.Body.String()) + } +} + +func TestScenarioStreamRetryStormUsesHardcodedGatewayFallback(t *testing.T) { + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"retry-storm"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{ + newRetryStorm: func(gatewayURL string) scenarioCaseExecutor { + if gatewayURL != "http://localhost:18080/mcp" { + t.Fatalf("gatewayURL = %q, want hardcoded fallback", gatewayURL) + } + return scenarioCaseExecutorFunc(func(context.Context) CaseResult { + return CaseResult{Name: "retry-storm-budget", Passed: true} + }) + }, + })(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200: %s", rec.Code, rec.Body.String()) + } +} + +func TestScenarioStreamYAMLScenarioUsesDefaultAgentURL(t *testing.T) { + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"mcp-crash"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{ + defaultAgentURL: "http://agent.example", + newRunner: func(agentURL string) caseExecutor { + if agentURL != "http://agent.example" { + t.Fatalf("agentURL = %q, want default", agentURL) + } + return serveStubRunner{ + traces: map[string][]TraceRow{"mcp-server-down": {{ToolName: "list_recent_charges", Decision: "upstream_error"}}}, + errs: map[string]error{}, + } + }, + })(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200: %s", rec.Code, rec.Body.String()) + } +} + +func TestScenarioStreamRejectsUnknownScenario(t *testing.T) { + req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"unknown","agent_url":"http://agent.example"}`)) + rec := httptest.NewRecorder() + + makeScenarioStreamHandler(scenarioDeps{})(rec, req) + + if rec.Code != http.StatusBadRequest { + t.Fatalf("status = %d, want 400", rec.Code) + } +} + +func TestRetryStormExecutorPollsUntilBudgetExceeded(t *testing.T) { + attempts := 0 + exec := retryStormExecutor{ + gatewayMCPURL: "http://gateway.example/mcp", + callGateway: func(context.Context, string, string) (string, error) { + return `{"error":{"message":"budget exceeded"}}`, nil + }, + queryTrace: func(context.Context, string) ([]TraceRow, error) { + attempts++ + if attempts < 3 { + return []TraceRow{{ToolName: "list_recent_charges", Decision: "upstream_error"}}, nil + } + return []TraceRow{ + {ToolName: "list_recent_charges", Decision: "upstream_error"}, + {ToolName: "list_recent_charges", Decision: "budgetExceeded"}, + }, nil + }, + initialize: func(context.Context) (string, error) { return "session-1", nil }, + pollInterval: time.Millisecond, + pollTimeout: 50 * time.Millisecond, + } + + got := exec.Run(context.Background()) + + if !got.Passed { + t.Fatalf("Passed = false, want true; failures = %#v", got.Failures) + } + if attempts < 3 { + t.Fatalf("query attempts = %d, want polling", attempts) + } +} + +func TestRetryStormExecutorFailsWhenInitializeFails(t *testing.T) { + exec := retryStormExecutor{ + gatewayMCPURL: "http://gateway.example/mcp", + initialize: func(context.Context) (string, error) { + return "", errors.New("initialize failed") + }, + } + + got := exec.Run(context.Background()) + + if got.Passed { + t.Fatal("Passed = true, want false") + } + if len(got.Failures) == 0 || got.Failures[0].Check != "run" { + t.Fatalf("Failures = %#v, want run failure", got.Failures) + } +} + +func TestRetryStormExecutorFailsWhenInitializeReturnsNoSession(t *testing.T) { + exec := retryStormExecutor{ + gatewayMCPURL: "http://gateway.example/mcp", + initialize: func(context.Context) (string, error) { + return "", nil + }, + } + + got := exec.Run(context.Background()) + + if got.Passed { + t.Fatal("Passed = true, want false") + } + if len(got.Failures) == 0 || got.Failures[0].Check != "run" { + t.Fatalf("Failures = %#v, want run failure", got.Failures) + } +} + +func TestRetryStormExecutorFailsWhenAuditQueryFails(t *testing.T) { + exec := retryStormExecutor{ + gatewayMCPURL: "http://gateway.example/mcp", + initialize: func(context.Context) (string, error) { return "session-1", nil }, + callGateway: func(context.Context, string, string) (string, error) { + return `{"error":{"message":"budget exceeded"}}`, nil + }, + queryTrace: func(context.Context, string) ([]TraceRow, error) { + return nil, errors.New("select failed") + }, + pollInterval: time.Millisecond, + pollTimeout: 5 * time.Millisecond, + } + + got := exec.Run(context.Background()) + + if got.Passed { + t.Fatal("Passed = true, want false") + } + if len(got.Failures) == 0 || got.Failures[0].Check != "run" { + t.Fatalf("Failures = %#v, want run failure", got.Failures) + } +} + +func TestRetryStormExecutorFailsAfterSixNonBudgetResponses(t *testing.T) { + calls := 0 + exec := retryStormExecutor{ + gatewayMCPURL: "http://gateway.example/mcp", + initialize: func(context.Context) (string, error) { return "session-1", nil }, + callGateway: func(context.Context, string, string) (string, error) { + calls++ + return `{"error":{"message":"upstream unavailable"}}`, nil + }, + queryTrace: func(context.Context, string) ([]TraceRow, error) { + return []TraceRow{{ToolName: "list_recent_charges", Decision: "upstream_error"}}, nil + }, + pollInterval: time.Millisecond, + pollTimeout: 5 * time.Millisecond, + } + + got := exec.Run(context.Background()) + + if got.Passed { + t.Fatal("Passed = true, want false") + } + if calls != 6 { + t.Fatalf("calls = %d, want 6", calls) + } +} + +func TestRetryStormExecutorFailsWhenBudgetNeverAppears(t *testing.T) { + exec := retryStormExecutor{ + gatewayMCPURL: "http://gateway.example/mcp", + initialize: func(context.Context) (string, error) { return "session-1", nil }, + callGateway: func(context.Context, string, string) (string, error) { + return `{"error":{"message":"budget exceeded"}}`, nil + }, + queryTrace: func(context.Context, string) ([]TraceRow, error) { + return []TraceRow{{ToolName: "list_recent_charges", Decision: "upstream_error"}}, nil + }, + pollInterval: time.Millisecond, + pollTimeout: 5 * time.Millisecond, + } + + got := exec.Run(context.Background()) + + if got.Passed { + t.Fatal("Passed = true, want false") + } + if len(got.Failures) == 0 || got.Failures[0].Check != "policyOutcome" { + t.Fatalf("Failures = %#v, want policyOutcome failure", got.Failures) + } +} diff --git a/cmd/eval-runner/serve.go b/cmd/eval-runner/serve.go index 06a004a..ba8a0aa 100644 --- a/cmd/eval-runner/serve.go +++ b/cmd/eval-runner/serve.go @@ -85,10 +85,31 @@ func serve(suitePath string) error { }) http.HandleFunc("POST /run-eval/custom", makeCustomEvalHandler(pool)) + http.HandleFunc("POST /run-eval/custom/stream", makeCustomEvalStreamHandler(func(agentURL string) caseExecutor { + return NewCaseRunner(agentURL, pool) + })) + http.HandleFunc("POST /run-scenario/stream", makeScenarioStreamHandler(scenarioDeps{ + pool: pool, + defaultAgentURL: cfg.AgentURL, + defaultGatewayMCPURL: os.Getenv("GATEWAY_MCP_URL"), + newRunner: func(agentURL string) caseExecutor { + return NewCaseRunner(agentURL, pool) + }, + newRetryStorm: func(gatewayURL string) scenarioCaseExecutor { + return newRetryStormExecutor(gatewayURL, pool) + }, + })) http.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) }) + http.HandleFunc("GET /stack-health", makeStackHealthHandler(stackHealthDeps{pool: pool})) + + gatewayMCPURL := os.Getenv("GATEWAY_MCP_URL") + if gatewayMCPURL == "" { + gatewayMCPURL = defaultGatewayMCPURL + } + go warmGatewayCapCache(gatewayMCPURL) slog.Info("eval server listening", "port", port) return http.ListenAndServe(":"+port, nil) @@ -128,43 +149,18 @@ func makeEvalHandler(runner caseExecutor, suite *EvalSuite, _ *pgxpool.Pool) htt return func(w http.ResponseWriter, r *http.Request) { results := make([]CaseResult, 0, len(suite.Cases)) for _, testCase := range suite.Cases { - trace, err := runner.Run(r.Context(), testCase) - result := CaseResult{Name: testCase.Name} - if err != nil { - result.Failures = []CheckFailure{{ - Check: "run", - Expected: "case completes successfully", - Observed: err.Error(), - }} - } else { - result = Evaluate(testCase, trace) - } - results = append(results, result) - } - - passCount := 0 - for _, r := range results { - if r.Passed { - passCount++ - } + results = append(results, runEvalCase(r.Context(), runner, testCase)) } - report := GenerateReport(results) + resp := summarizeResults(results) if r.Header.Get("Accept") == "application/json" { - resp := evalResponse{ - Passed: passCount == len(results), - PassCount: passCount, - TotalCount: len(results), - Cases: results, - Report: report, - } w.Header().Set("Content-Type", "application/json") _ = json.NewEncoder(w).Encode(resp) return } w.Header().Set("Content-Type", "text/plain") - _, _ = fmt.Fprint(w, report) + _, _ = fmt.Fprint(w, resp.Report) } } diff --git a/cmd/eval-runner/serve_test.go b/cmd/eval-runner/serve_test.go new file mode 100644 index 0000000..052dbc9 --- /dev/null +++ b/cmd/eval-runner/serve_test.go @@ -0,0 +1,133 @@ +package main + +import ( + "context" + "encoding/json" + "errors" + "net/http" + "net/http/httptest" + "strings" + "testing" +) + +type serveStubRunner struct { + traces map[string][]TraceRow + errs map[string]error +} + +func (s serveStubRunner) Run(_ context.Context, c EvalCase) ([]TraceRow, error) { + if err := s.errs[c.Name]; err != nil { + return nil, err + } + return s.traces[c.Name], nil +} + +func TestRunEvalCaseReturnsRunFailure(t *testing.T) { + runner := serveStubRunner{errs: map[string]error{"bad": errors.New("agent down")}} + + got := runEvalCase(context.Background(), runner, EvalCase{Name: "bad", Input: "x"}) + + if got.Passed { + t.Fatal("Passed = true, want false") + } + if len(got.Failures) != 1 || got.Failures[0].Check != "run" { + t.Fatalf("Failures = %#v, want run failure", got.Failures) + } + if got.Trace != nil { + t.Fatalf("Trace = %#v, want nil", got.Trace) + } +} + +func TestCustomEvalJSONIncludesTrace(t *testing.T) { + runnerTrace := []TraceRow{{ToolName: "lookup_customer", Decision: "allow"}} + runner := serveStubRunner{ + traces: map[string][]TraceRow{"lookup": runnerTrace}, + errs: map[string]error{}, + } + suite := &EvalSuite{Cases: []EvalCase{{ + Name: "lookup", + Input: "lookup", + MustInclude: []string{"lookup_customer"}, + PolicyOutcome: "allow", + }}} + req := httptest.NewRequest(http.MethodPost, "/run-eval", nil) + req.Header.Set("Accept", "application/json") + rec := httptest.NewRecorder() + + makeEvalHandler(runner, suite, nil)(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200: %s", rec.Code, rec.Body.String()) + } + var body evalResponse + if err := json.NewDecoder(rec.Body).Decode(&body); err != nil { + t.Fatalf("Decode(response): %v", err) + } + if len(body.Cases) != 1 || len(body.Cases[0].Trace) != 1 { + t.Fatalf("cases = %#v, want trace in response", body.Cases) + } + if body.Cases[0].Trace[0].ToolName != "lookup_customer" { + t.Fatalf("trace = %#v, want lookup_customer", body.Cases[0].Trace) + } +} + +func TestCustomEvalRejectsMissingAgentURL(t *testing.T) { + req := httptest.NewRequest(http.MethodPost, "/run-eval/custom", strings.NewReader(`{"suite":"cases: []"}`)) + rec := httptest.NewRecorder() + + makeCustomEvalHandler(nil)(rec, req) + + if rec.Code != http.StatusBadRequest { + t.Fatalf("status = %d, want 400", rec.Code) + } +} + +func TestCustomEvalStreamEmitsCaseEventsAndSummary(t *testing.T) { + body := `{"agent_url":"http://agent.example","suite":"cases:\n - name: lookup\n input: lookup\n mustInclude:\n - lookup_customer\n policyOutcome: allow\n"}` + req := httptest.NewRequest(http.MethodPost, "/run-eval/custom/stream", strings.NewReader(body)) + rec := httptest.NewRecorder() + + makeCustomEvalStreamHandler(func(agentURL string) caseExecutor { + if agentURL != "http://agent.example" { + t.Fatalf("agentURL = %q, want http://agent.example", agentURL) + } + return serveStubRunner{ + traces: map[string][]TraceRow{"lookup": {{ToolName: "lookup_customer", Decision: "allow"}}}, + errs: map[string]error{}, + } + })(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200: %s", rec.Code, rec.Body.String()) + } + got := rec.Body.String() + for _, want := range []string{"event: case_start", `"name":"lookup"`, "event: case_result", "event: summary"} { + if !strings.Contains(got, want) { + t.Fatalf("stream = %q, missing %q", got, want) + } + } + if strings.Index(got, "event: case_start") > strings.Index(got, "event: case_result") { + t.Fatalf("case_start should precede case_result: %q", got) + } +} + +func TestCustomEvalStreamContinuesAfterRunnerError(t *testing.T) { + body := `{"agent_url":"http://agent.example","suite":"cases:\n - name: bad\n input: bad\n policyOutcome: allow\n - name: good\n input: good\n mustInclude:\n - lookup_customer\n policyOutcome: allow\n"}` + req := httptest.NewRequest(http.MethodPost, "/run-eval/custom/stream", strings.NewReader(body)) + rec := httptest.NewRecorder() + + makeCustomEvalStreamHandler(func(string) caseExecutor { + return serveStubRunner{ + traces: map[string][]TraceRow{"good": {{ToolName: "lookup_customer", Decision: "allow"}}}, + errs: map[string]error{"bad": errors.New("agent down")}, + } + })(rec, req) + + got := rec.Body.String() + if strings.Count(got, "event: case_result") != 2 { + t.Fatalf("case_result count = %d, want 2 in %q", strings.Count(got, "event: case_result"), got) + } + if !strings.Contains(got, `"pass_count":1`) { + t.Fatalf("stream = %q, want one pass in summary", got) + } +} diff --git a/cmd/eval-runner/stack_health.go b/cmd/eval-runner/stack_health.go new file mode 100644 index 0000000..82b3559 --- /dev/null +++ b/cmd/eval-runner/stack_health.go @@ -0,0 +1,82 @@ +package main + +import ( + "context" + "encoding/json" + "net" + "net/http" + "time" + + "github.com/jackc/pgx/v5/pgxpool" +) + +const stackHealthProbeTimeout = 750 * time.Millisecond + +type stackHealthResponse struct { + Services []stackHealthService `json:"services"` +} + +type stackHealthService struct { + Name string `json:"name"` + Status string `json:"status"` + Detail string `json:"detail,omitempty"` +} + +type stackHealthDeps struct { + pool *pgxpool.Pool + httpClient *http.Client +} + +func makeStackHealthHandler(deps stackHealthDeps) http.HandlerFunc { + return func(w http.ResponseWriter, r *http.Request) { + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(stackHealthResponse{ + Services: []stackHealthService{ + probeHTTPService(deps.httpClient, "Gateway", "http://localhost:18080/mcp"), + probeTCPService("MCP", "127.0.0.1:18421"), + probeHTTPService(deps.httpClient, "Slack", "http://localhost:18090/healthz"), + probePostgresService(deps.pool), + }, + }) + } +} + +func probeHTTPService(client *http.Client, name, target string) stackHealthService { + if client == nil { + client = &http.Client{Timeout: stackHealthProbeTimeout} + } + req, err := http.NewRequest(http.MethodGet, target, nil) + if err != nil { + return stackHealthService{Name: name, Status: "unknown", Detail: err.Error()} + } + resp, err := client.Do(req) + if err != nil { + return stackHealthService{Name: name, Status: "down", Detail: target} + } + defer func() { _ = resp.Body.Close() }() + if resp.StatusCode >= 200 && resp.StatusCode < 300 { + return stackHealthService{Name: name, Status: "up", Detail: target} + } + return stackHealthService{Name: name, Status: "down", Detail: target} +} + +func probeTCPService(name, target string) stackHealthService { + conn, err := net.DialTimeout("tcp", target, stackHealthProbeTimeout) + if err != nil { + return stackHealthService{Name: name, Status: "down", Detail: target} + } + _ = conn.Close() + return stackHealthService{Name: name, Status: "up", Detail: target} +} + +func probePostgresService(pool *pgxpool.Pool) stackHealthService { + if pool == nil { + return stackHealthService{Name: "Postgres", Status: "unknown", Detail: "pool unavailable"} + } + ctx, cancel := context.WithTimeout(context.Background(), stackHealthProbeTimeout) + defer cancel() + if err := pool.Ping(ctx); err != nil { + return stackHealthService{Name: "Postgres", Status: "down", Detail: "configured DSN unreachable"} + } + return stackHealthService{Name: "Postgres", Status: "up", Detail: "configured DSN reachable"} +} diff --git a/cmd/eval-runner/stack_health_test.go b/cmd/eval-runner/stack_health_test.go new file mode 100644 index 0000000..bfdd0ff --- /dev/null +++ b/cmd/eval-runner/stack_health_test.go @@ -0,0 +1,26 @@ +package main + +import ( + "encoding/json" + "net/http" + "net/http/httptest" + "testing" +) + +func TestStackHealthReturnsJSONWhenProbesFail(t *testing.T) { + req := httptest.NewRequest(http.MethodGet, "/stack-health", nil) + rec := httptest.NewRecorder() + + makeStackHealthHandler(stackHealthDeps{})(rec, req) + + if rec.Code != http.StatusOK { + t.Fatalf("status = %d, want 200", rec.Code) + } + var body stackHealthResponse + if err := json.NewDecoder(rec.Body).Decode(&body); err != nil { + t.Fatalf("Decode(response): %v", err) + } + if len(body.Services) == 0 { + t.Fatal("services is empty, want health rows") + } +} diff --git a/cmd/eval-runner/stream.go b/cmd/eval-runner/stream.go new file mode 100644 index 0000000..4f52d15 --- /dev/null +++ b/cmd/eval-runner/stream.go @@ -0,0 +1,93 @@ +package main + +import ( + "context" + "encoding/json" + "fmt" + "net/http" + "strings" +) + +type caseStartEvent struct { + Name string `json:"name"` + Index int `json:"index"` + Total int `json:"total"` +} + +type caseResultEvent struct { + Index int `json:"index"` + Total int `json:"total"` + Result CaseResult `json:"result"` +} + +type runnerFactory func(agentURL string) caseExecutor + +func writeSSE(w http.ResponseWriter, event string, payload any) error { + data, err := json.Marshal(payload) + if err != nil { + return err + } + if _, err := fmt.Fprintf(w, "event: %s\ndata: %s\n\n", event, data); err != nil { + return err + } + + flusher, ok := w.(http.Flusher) + if !ok { + return fmt.Errorf("streaming unsupported") + } + flusher.Flush() + return nil +} + +func streamEvalSuite(ctx context.Context, w http.ResponseWriter, runner caseExecutor, cases []EvalCase) { + results := make([]CaseResult, 0, len(cases)) + total := len(cases) + for index, testCase := range cases { + if err := writeSSE(w, "case_start", caseStartEvent{Name: testCase.Name, Index: index, Total: total}); err != nil { + return + } + result := runEvalCase(ctx, runner, testCase) + results = append(results, result) + if err := writeSSE(w, "case_result", caseResultEvent{Index: index, Total: total, Result: result}); err != nil { + return + } + } + _ = writeSSE(w, "summary", summarizeResults(results)) +} + +func makeCustomEvalStreamHandler(newRunner runnerFactory) http.HandlerFunc { + return func(w http.ResponseWriter, r *http.Request) { + var body struct { + Suite string `json:"suite"` + AgentURL string `json:"agent_url"` + } + if err := json.NewDecoder(r.Body).Decode(&body); err != nil { + http.Error(w, fmt.Sprintf("invalid request: %v", err), http.StatusBadRequest) + return + } + if strings.TrimSpace(body.AgentURL) == "" { + http.Error(w, "missing agent_url", http.StatusBadRequest) + return + } + if strings.TrimSpace(body.Suite) == "" { + http.Error(w, "missing suite", http.StatusBadRequest) + return + } + + suite, err := LoadSuiteFromReader(strings.NewReader(body.Suite)) + if err != nil { + http.Error(w, fmt.Sprintf("invalid suite: %v", err), http.StatusBadRequest) + return + } + + if _, ok := w.(http.Flusher); !ok { + http.Error(w, "streaming unsupported", http.StatusInternalServerError) + return + } + + w.Header().Set("Content-Type", "text/event-stream") + w.Header().Set("Cache-Control", "no-cache") + + streamEvalSuite(r.Context(), w, newRunner(body.AgentURL), suite.Cases) + } +} diff --git a/cmd/eval-runner/types.go b/cmd/eval-runner/types.go index 8782278..327b305 100644 --- a/cmd/eval-runner/types.go +++ b/cmd/eval-runner/types.go @@ -16,9 +16,9 @@ type EvalSuite struct { } type TraceRow struct { - ToolName string - Decision string - Arguments json.RawMessage + ToolName string `json:"tool_name"` + Decision string `json:"decision"` + Arguments json.RawMessage `json:"arguments,omitempty"` } type CheckFailure struct { @@ -30,5 +30,6 @@ type CheckFailure struct { type CaseResult struct { Name string `json:"name"` Passed bool `json:"passed"` - Failures []CheckFailure `json:"failures"` + Failures []CheckFailure `json:"failures,omitempty"` + Trace []TraceRow `json:"trace,omitempty"` } diff --git a/cmd/eval-runner/ui.html b/cmd/eval-runner/ui.html index 5361e7e..6bf1e7b 100644 --- a/cmd/eval-runner/ui.html +++ b/cmd/eval-runner/ui.html @@ -3,160 +3,630 @@
-Base URL of the agent to evaluate (must expose a /trigger endpoint).
Operator Surface
++ Choose a prepared scenario, stream case results as they complete, and inspect the audit trail behind each verdict. +
| Case | -Status | -Failures | -
|---|
Preset runs are server-owned. Custom YAML is a separate mode.
+Required for YAML-backed presets and Custom YAML mode.
+Used by the Retry Storm preset.
+Expected outages should be visible before you run a scenario.
+Preset plans are read-only. Switch to Custom YAML to edit.
+Rows update as streamed case events arrive.
+