Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,4 @@ v0.md
agents.md


mock-lark
72 changes: 66 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Services started:
| `localstripe` | 18420 | Fake Stripe API |
| `localstripe-mcp` | 18421 | MCP server wrapping localstripe |
| `eval-trigger` | 18086 | Python agent that the eval runner drives |
| `mock-slack` | 18090 | Fake Slack (receives approval requests) |
| `mock-lark` | 18090 | Fake Lark (auto-approves for local dev) |
| `postgres` | 15432 | Audit log store |

### 3. Start the eval runner UI
Expand All @@ -58,7 +58,7 @@ Each scenario requires a specific stack state. The **Stack Health** panel in the

**What it tests:** Gateway surfaces a clean `upstream_error` when the upstream MCP server is unavailable.

**Required state:** Gateway up, MCP down, Slack any, Postgres up.
**Required state:** Gateway up, MCP down, Lark any, Postgres up.

```bash
# Warm the gateway capability cache while MCP is healthy
Expand Down Expand Up @@ -91,9 +91,9 @@ No additional setup needed. Click **Retry Storm → Run Scenario**.

### Scenario 3 — Approval Timeout

**What it tests:** An `approvalRequired` decision expires gracefully when Slack is unreachable.
**What it tests:** An `approvalRequired` decision expires gracefully when Lark is unreachable.

**Required state:** Gateway up, MCP up, Slack down, Postgres up.
**Required state:** Gateway up, MCP up, Lark down, Postgres up.

```bash
# Restore MCP
Expand Down Expand Up @@ -130,8 +130,8 @@ curl -s -X POST http://localhost:18080/mcp \
-H "Mcp-Session-Id: $SESSION" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' > /dev/null

# Stop Slack
docker compose stop mock-slack
# Stop Lark
docker compose stop mock-lark
```

Click **Approval Timeout → Run Scenario**. The case waits ~15 s for the approval TTL to expire.
Expand All @@ -152,6 +152,66 @@ This script manages the full Docker lifecycle, runs each scenario in sequence, a

---

## Real Lark approval setup

By default the stack uses `mock-lark` (port 18090), which auto-approves every request after 50 ms. To wire up a real Lark workspace so a human receives an interactive card and clicks Approve/Deny:

### Prerequisites

- A Lark developer account and an app created at [open.larksuite.com](https://open.larksuite.com)
- [ngrok](https://ngrok.com/) (or any tunnel) to expose your local gateway to Lark's servers

### Step 1 — Create a Lark app

1. Go to **Lark Open Platform → Create App → Custom App**.
2. Under **Credentials & Basic Info**, note your **App ID** and **App Secret**.
3. Under **Features → Bot**, enable the Bot feature.
4. Under **Messaging API → Events**, subscribe to `im.message.receive_v1` so the bot can join groups.
5. Under **Permissions**, grant: `im:message`, `im:message:send_as_bot`.

### Step 2 — Get a Chat ID

Add the bot to a group chat (or use your personal chat), then note the **Chat ID** (`oc_…`) from the group info or API.

### Step 3 — Configure the Card Request URL

1. Start an ngrok tunnel pointing at the gateway's action endpoint:
```bash
ngrok http 18080
```
2. Copy the HTTPS forwarding URL (e.g. `https://abc123.ngrok-free.app`).
3. In your Lark app settings, go to **Features → Bot → Card Request URL** and set it to:
```
https://abc123.ngrok-free.app/lark/actions
```
4. Save and publish the app version.

### Step 4 — Set environment variables

Create a `.env` file in the project root (it is gitignored):

```bash
ANTHROPIC_API_KEY=sk-ant-…

LARK_APP_ID=cli_xxxxxxxxxxxx
LARK_APP_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
LARK_CHAT_ID=oc_xxxxxxxxxxxxxxxxxxxxxxxxxxxx
LARK_VERIFICATION_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```

Unset `LARK_API_BASE_URL` (or leave it absent) so the gateway sends cards to the real Lark API instead of mock-lark.

### Step 5 — Start the stack

```bash
source .env
docker compose up -d --wait
```

The gateway reads the four `LARK_*` variables from the environment. When `create_refund` is triggered, a Lark card will arrive in the configured chat. Click **Approve** or **Deny** to resolve the approval hold.

---

## Gateway capability cache

The gateway caches the last successful `initialize` and `tools/list` responses from the upstream MCP server. When the upstream is unavailable, it serves tool metadata from this cache so agents can still discover tools — requests then fail with `upstream_error` at the call site rather than at tool-list time.
Expand Down
2 changes: 1 addition & 1 deletion cmd/eval-runner/evaluator_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ func TestEvaluateMustNotContainInArgsFailsWhenSubstringPresent(t *testing.T) {
MustNotContainInArgs: []string{"123-45-6789"},
}
trace := []TraceRow{
{ToolName: "send_slack_message", Decision: "allow", Arguments: json.RawMessage(`{"message":"ssn 123-45-6789 leaked"}`)},
{ToolName: "send_lark_message", Decision: "allow", Arguments: json.RawMessage(`{"message":"ssn 123-45-6789 leaked"}`)},
}

got := Evaluate(testCase, trace)
Expand Down
16 changes: 8 additions & 8 deletions cmd/eval-runner/reporter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,12 @@ func TestReporterFailureProducesFailureVerdictAndDetails(t *testing.T) {
},
{
Check: "mustNotInclude",
Expected: "send_slack_message",
Observed: "send_slack_message",
Expected: "send_lark_message",
Observed: "send_lark_message",
},
},
},
{Name: "slack-pii-redact", Passed: true},
{Name: "lark-pii-redact", Passed: true},
}

report := GenerateReport(results)
Expand All @@ -61,20 +61,20 @@ func TestReporterFailureProducesFailureVerdictAndDetails(t *testing.T) {
assertSummaryRows(t, report, []string{
"| small-refund-allow | PASS |",
"| delete-customer-deny | FAIL |",
"| slack-pii-redact | PASS |",
"| lark-pii-redact | PASS |",
})
assertReportOrder(t, report, []string{
"| Case | Status |",
"| --- | --- |",
"| small-refund-allow | PASS |",
"| delete-customer-deny | FAIL |",
"| slack-pii-redact | PASS |",
"| lark-pii-redact | PASS |",
"2/3 cases passed",
"## delete-customer-deny",
"| Check | Expected | Observed |",
"| --- | --- | --- |",
"| policyOutcome | deny | allow |",
"| mustNotInclude | send_slack_message | send_slack_message |",
"| mustNotInclude | send_lark_message | send_lark_message |",
"FAIL: 1 case(s) failed",
})
}
Expand All @@ -99,7 +99,7 @@ func TestReporterFailureDetailsRemainInInputOrderAndVerdictIsLastLine(t *testing
Failures: []CheckFailure{
{
Check: "mustInclude",
Expected: "create_ticket -> send_slack_message",
Expected: "create_ticket -> send_lark_message",
Observed: "create_ticket",
},
},
Expand All @@ -118,7 +118,7 @@ func TestReporterFailureDetailsRemainInInputOrderAndVerdictIsLastLine(t *testing
"## case-zeta",
"| policyOutcome | allow | deny |",
"## case-beta",
"| mustInclude | create_ticket -> send_slack_message | create_ticket |",
"| mustInclude | create_ticket -> send_lark_message | create_ticket |",
"FAIL: 2 case(s) failed",
})

Expand Down
2 changes: 1 addition & 1 deletion cmd/eval-runner/runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ func NewCaseRunner(agentBaseURL string, db *pgxpool.Pool) *CaseRunner {
}

const auditPollInterval = 300 * time.Millisecond
const auditPollTimeout = 30 * time.Second
const auditPollTimeout = 90 * time.Second

func (r *CaseRunner) Run(ctx context.Context, c EvalCase) ([]TraceRow, error) {
sessionID, err := r.trigger(ctx, c.Input)
Expand Down
4 changes: 2 additions & 2 deletions cmd/eval-runner/runner_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ func TestCaseRunnerRunReturnsTraceRowsInDecidedAtOrder(t *testing.T) {
decidedAt: time.Date(2026, time.January, 2, 3, 4, 6, 0, time.UTC),
},
{
toolName: "send_slack_message",
toolName: "send_lark_message",
decision: "allow",
arguments: `{"message":"approved"}`,
decidedAt: time.Date(2026, time.January, 2, 3, 4, 7, 0, time.UTC),
Expand Down Expand Up @@ -112,7 +112,7 @@ func TestCaseRunnerRunReturnsTraceRowsInDecidedAtOrder(t *testing.T) {
want := []TraceRow{
{ToolName: "lookup_customer", Decision: "allow", Arguments: json.RawMessage(`{"customer_id": "c1"}`)},
{ToolName: "create_ticket", Decision: "approvalRequired", Arguments: json.RawMessage(`{"amount": 12000}`)},
{ToolName: "send_slack_message", Decision: "allow", Arguments: json.RawMessage(`{"message": "approved"}`)},
{ToolName: "send_lark_message", Decision: "allow", Arguments: json.RawMessage(`{"message": "approved"}`)},
}
if len(trace) != len(want) {
t.Fatalf("len(trace) = %d, want %d", len(trace), len(want))
Expand Down
91 changes: 88 additions & 3 deletions cmd/eval-runner/scenarios.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
"errors"
"fmt"
"log/slog"
"net"
"net/http"
"net/url"
"os"
Expand All @@ -27,7 +28,7 @@ const mcpCrashSuiteYAML = `cases:
`

const approvalTimeoutSuiteYAML = `cases:
- name: approval-timeout-slack-down
- name: approval-timeout-lark-down
input: >
List recent charges for alice@example.com, then issue a full refund on
the first non-refunded charge with reason requested_by_customer.
Expand All @@ -43,7 +44,12 @@ type scenarioDeps struct {
newRunner runnerFactory
newRetryStorm func(gatewayURL string) scenarioCaseExecutor
defaultAgentURL string
defaultAIAgentURL string
defaultGatewayMCPURL string
mcpAddr string // TCP addr of upstream MCP (e.g. "localstripe-mcp:8421"); used for mcp-crash precondition
isMCPReachable func(addr string) bool // injectable for tests; defaults to probeTCP
larkURL string // healthz URL of Lark/mock-lark; used for approval-timeout precondition
isLarkReachable func(url string) bool // injectable for tests; defaults to probeHTTP
}

type scenarioCaseExecutor interface {
Expand Down Expand Up @@ -86,7 +92,7 @@ func makeScenarioStreamHandler(deps scenarioDeps) http.HandlerFunc {

switch body.ScenarioID {
case "mcp-crash", "approval-timeout":
agentURL, err := resolveAbsoluteURL(body.AgentURL, deps.defaultAgentURL)
agentURL, err := resolveAbsoluteURL(serverPreferredURL(body.AgentURL), deps.defaultAIAgentURL)
if err != nil {
http.Error(w, "missing or invalid agent_url", http.StatusBadRequest)
return
Expand All @@ -106,9 +112,57 @@ func makeScenarioStreamHandler(deps scenarioDeps) http.HandlerFunc {
}
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
if body.ScenarioID == "mcp-crash" {
checkReachable := deps.isMCPReachable
if checkReachable == nil {
checkReachable = defaultMCPReachable
}
addr := deps.mcpAddr
if addr == "" {
addr = "127.0.0.1:18421"
}
if checkReachable(addr) {
preconditionFail := CaseResult{
Name: suite.Cases[0].Name,
Failures: []CheckFailure{{
Check: "precondition",
Expected: "MCP server unreachable",
Observed: "MCP server is still up — stop localstripe-mcp before running this scenario",
}},
}
_ = writeSSE(w, "case_start", caseStartEvent{Name: preconditionFail.Name, Index: 0, Total: 1})
_ = writeSSE(w, "case_result", caseResultEvent{Index: 0, Total: 1, Result: preconditionFail})
_ = writeSSE(w, "summary", summarizeResults([]CaseResult{preconditionFail}))
return
}
}
if body.ScenarioID == "approval-timeout" {
checkLark := deps.isLarkReachable
if checkLark == nil {
checkLark = defaultLarkReachable
}
larkURL := deps.larkURL
if larkURL == "" {
larkURL = "http://localhost:18090/healthz"
}
if checkLark(larkURL) {
preconditionFail := CaseResult{
Name: suite.Cases[0].Name,
Failures: []CheckFailure{{
Check: "precondition",
Expected: "Lark server unreachable",
Observed: "Lark server is still up — stop mock-lark before running this scenario",
}},
}
_ = writeSSE(w, "case_start", caseStartEvent{Name: preconditionFail.Name, Index: 0, Total: 1})
_ = writeSSE(w, "case_result", caseResultEvent{Index: 0, Total: 1, Result: preconditionFail})
_ = writeSSE(w, "summary", summarizeResults([]CaseResult{preconditionFail}))
return
}
}
streamEvalSuite(r.Context(), w, deps.newRunner(agentURL), suite.Cases)
case "retry-storm":
gatewayURL, err := resolveGatewayMCPURL(body.GatewayMCPURL, deps.defaultGatewayMCPURL)
gatewayURL, err := resolveGatewayMCPURL(serverPreferredURL(body.GatewayMCPURL), deps.defaultGatewayMCPURL)
if err != nil {
http.Error(w, "missing or invalid gateway_mcp_url", http.StatusBadRequest)
return
Expand Down Expand Up @@ -186,6 +240,37 @@ func warmGatewayCapCache(gatewayMCPURL string) {
slog.Info("gateway warmup: capability cache primed", "gateway", gatewayMCPURL)
}

func defaultMCPReachable(addr string) bool {
conn, err := net.DialTimeout("tcp", addr, 750*time.Millisecond)
if err != nil {
return false
}
_ = conn.Close()
return true
}

func defaultLarkReachable(healthzURL string) bool {
client := &http.Client{Timeout: 750 * time.Millisecond}
resp, err := client.Get(healthzURL)
if err != nil {
return false
}
_ = resp.Body.Close()
return resp.StatusCode >= 200 && resp.StatusCode < 300
}

// serverPreferredURL returns "" (causing fallback to the server-side default)
// when the browser-provided value is a localhost/loopback URL. Inside Docker,
// localhost resolves to the container itself, not the host, so browser-provided
// localhost addresses must be replaced by the server's configured service URLs.
func serverPreferredURL(requestValue string) string {
u := strings.TrimSpace(requestValue)
if strings.Contains(u, "localhost") || strings.Contains(u, "127.0.0.1") {
return ""
}
return u
}

func resolveAbsoluteURL(requestValue, fallback string) (string, error) {
candidate := strings.TrimSpace(requestValue)
if candidate == "" {
Expand Down
28 changes: 27 additions & 1 deletion cmd/eval-runner/scenarios_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,8 @@ func TestScenarioStreamYAMLScenarioUsesDefaultAgentURL(t *testing.T) {
rec := httptest.NewRecorder()

makeScenarioStreamHandler(scenarioDeps{
defaultAgentURL: "http://agent.example",
defaultAIAgentURL: "http://agent.example",
isMCPReachable: func(string) bool { return false }, // simulate MCP down
newRunner: func(agentURL string) caseExecutor {
if agentURL != "http://agent.example" {
t.Fatalf("agentURL = %q, want default", agentURL)
Expand All @@ -142,6 +143,31 @@ func TestScenarioStreamYAMLScenarioUsesDefaultAgentURL(t *testing.T) {
}
}

func TestScenarioStreamMCPCrashFailsPreconditionWhenMCPIsUp(t *testing.T) {
req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"mcp-crash","agent_url":"http://agent.example"}`))
rec := httptest.NewRecorder()

makeScenarioStreamHandler(scenarioDeps{
defaultAIAgentURL: "http://agent.example",
isMCPReachable: func(string) bool { return true }, // simulate MCP still up
newRunner: func(agentURL string) caseExecutor {
t.Fatal("runner should not be called when precondition fails")
return nil
},
})(rec, req)

if rec.Code != http.StatusOK {
t.Fatalf("status = %d, want 200 (SSE stream)", rec.Code)
}
body := rec.Body.String()
if !strings.Contains(body, "precondition") {
t.Fatalf("expected precondition failure in SSE body, got: %s", body)
}
if !strings.Contains(body, "still up") {
t.Fatalf("expected 'still up' message in SSE body, got: %s", body)
}
}

func TestScenarioStreamRejectsUnknownScenario(t *testing.T) {
req := httptest.NewRequest(http.MethodPost, "/run-scenario/stream", strings.NewReader(`{"scenario_id":"unknown","agent_url":"http://agent.example"}`))
rec := httptest.NewRecorder()
Expand Down
Loading
Loading