Skip to content

performance-test-mcp rather than just assuming the mcp is going to be...#100

Open
brkastner wants to merge 10 commits into
mainfrom
plan/performance-test-mcp
Open

performance-test-mcp rather than just assuming the mcp is going to be...#100
brkastner wants to merge 10 commits into
mainfrom
plan/performance-test-mcp

Conversation

@brkastner
Copy link
Copy Markdown
Contributor

summary

  • description: performance-test-mcp

rather than just assuming the mcp is going to be quicker than letting the agent use standard bash calls, yeah it's great for consistency, but I would like to actually benchmark how fast or slow mcp approach is for read file calls and grep calls. to make sure that while maybe a little bit of extra latency is okay, in the interest of hardening what agents do, as far as bash commands go specifically, and forcing them into newer models and having caching in front and code aware searching. that's worth a little latency, but I would like to actually do a full benchmark on both gpt and claude to see just how well or poorly the mcp server performs for these tasks compared to traditional tooling used by harnesses

  • goal: Build a reproducible benchmark suite that measures kasmos MCP stdio tool latency (read_file, grep, find_files) against direct function calls (simulating Claude Code built-in tools) and raw Bash subprocess calls, to determine whether the MCP-first policy's latency cost is acceptable relative to its caching, consistency, and code-awareness benefits.
  • architecture: A Go benchmark package (internal/mcpserver/bench/) using go test -bench that starts a real kas mcp subprocess via stdio transport, exercises tools across file sizes and search patterns, and compares against direct function calls (representing built-in tool overhead) and shell-command baselines (representing Bash tool overhead). A KAS_MCP_NOCACHE=1 toggle isolates application-level cache benefit. Results are emitted as structured JSON with per-operation, per-arm latency percentiles and overhead multipliers. A supplementary shell script and markdown guide provide empirical validation guidance for a live Claude Code session.
  • tech stack: Go testing.B benchmarks, mcp-go client with stdio transport, ristretto cache (toggled via KAS_MCP_NOCACHE), synthetic runtime-generated fixtures, rg/fd CLI tools
  • review cycle: 1

tasks

  • 1. benchmark harness, runtime fixtures, and report plumbing
  • 2. MCP stdio benchmark arm
  • 3. direct and Bash baseline arms
  • 4. live-session validation kit

reviewer notes

Approved. Round 2 — both Round 1 blocking findings resolved.

verified fixes

  • mcp_bench_test.go — private sampleCollector removed; MCP arm now calls addBenchSample(sc.key, "mcp_cold"/"mcp_warm", ...) into the shared package-level map. flushBenchReport() called at end of BenchmarkMCP. Cross-arm JSON report now correct.
  • report.go:BuildReport — overhead ratios now use mcp_warm P50 as the MCP numerator (falls back to legacy "mcp" arm for backward compat). MCPvsDirect and MCPvsBash produce meaningful values.
  • report_test.go — test data updated to mcp_cold/mcp_warm; legacy_mcp_arm_fallback table case added.

self-fixed (no coder action needed)

  • mcp_bench_test.go:107 — stale doc comment referenced sampleCollector after removal; updated to name addBenchSample/benchSamples.

acceptance criteria

  • scope: pass
  • no_scope_creep: pass
  • requirements_met: pass
  • wave_isolation_ok: pass
  • plan_goal_achieved: pass

verification

go test ./internal/mcpserver/bench/... -run 'TestSmoke_StdioClientListsRequiredTools|TestBuildReport' -v — 16 tests PASS
gofmt -l ./internal/mcpserver/bench/ — no output
go vet ./internal/mcpserver/bench/... — no output

changes

.agents/skills/kasmos-architect/SKILL.md
.agents/skills/kasmos-coder/SKILL.md
.agents/skills/kasmos-fixer/SKILL.md
.agents/skills/kasmos-lifecycle/SKILL.md
.agents/skills/kasmos-master/SKILL.md
.agents/skills/kasmos-planner/SKILL.md
.agents/skills/kasmos-reviewer/SKILL.md
.claude/agents/coder.md
.claude/agents/reviewer.md
internal/mcpserver/bench/SESSION_BENCH.md
internal/mcpserver/bench/baseline_bench_test.go
internal/mcpserver/bench/helpers_test.go
internal/mcpserver/bench/mcp_bench_test.go
internal/mcpserver/bench/report.go
internal/mcpserver/bench/report_test.go
internal/mcpserver/bench/session_bench.sh
internal/mcpserver/bench/smoke_test.go

commits

a2c22a7 fix: wire MCP arm to shared collector, update BuildReport to use mcp_warm for ratios (reviewer self-fix)
37cd0ce fix: gofmt alignment, jq field names, and doc key format (reviewer self-fix)
4393ec2 feat(task-4): live-session validation kit (session_bench.sh + SESSION_BENCH.md)
46d570e feat(task-3): direct and Bash baseline arms
1537ba9 feat(task-2): MCP stdio benchmark arm with cold/warm sub-benchmarks
bd4e4d6 feat(task-1): benchmark harness, runtime fixtures, and report plumbing

stats

.agents/skills/kasmos-architect/SKILL.md | 16 +-
.agents/skills/kasmos-coder/SKILL.md | 6 +-
.agents/skills/kasmos-fixer/SKILL.md | 8 +-
.agents/skills/kasmos-lifecycle/SKILL.md | 4 +-
.agents/skills/kasmos-master/SKILL.md | 4 +-
.agents/skills/kasmos-planner/SKILL.md | 15 +-
.agents/skills/kasmos-reviewer/SKILL.md | 6 +-
.claude/agents/coder.md | 2 +-
.claude/agents/reviewer.md | 2 +-
internal/mcpserver/bench/SESSION_BENCH.md | 161 ++++++++++
internal/mcpserver/bench/baseline_bench_test.go | 392 ++++++++++++++++++++++++
internal/mcpserver/bench/helpers_test.go | 261 ++++++++++++++++
internal/mcpserver/bench/mcp_bench_test.go | 174 +++++++++++
internal/mcpserver/bench/report.go | 161 ++++++++++
internal/mcpserver/bench/report_test.go | 205 +++++++++++++
internal/mcpserver/bench/session_bench.sh | 135 ++++++++
internal/mcpserver/bench/smoke_test.go | 34 ++
17 files changed, 1555 insertions(+), 31 deletions(-)

- helpers_test.go: scenario catalog (8 scenarios), TestMain builds kas binary,
  generates deterministic fixture tree, verifies rg/fd in PATH; newMCPStdioClient
  uses transport.NewStdioWithOptions + WithCommandFunc with fixtureRoot as cmd.Dir
- report.go: LatencyStats/ArmStats/OperationReport/BenchmarkReport structs,
  ComputeLatencyStats (min/p50/p95/p99/max/mean), Multiplier with zero-denom guard,
  BuildReport with stable arm ordering, WriteReport following orchestration/cache.go style
- report_test.go: table-driven tests covering percentile math, multiplier math,
  zero-denominator handling, arm ordering, and round-trip JSON serialization
- smoke_test.go: TestSmoke_StdioClientListsRequiredTools asserts read_file/grep/find_files
Implements BenchmarkMCP with 16 sub-benchmarks (8 scenarios × 2 arms)
covering read_file, grep, and find_files tools in both cold (nocache)
and warm (cache-primed) configurations.

- buildMCPRequest maps scenario fields to exact handler arg names
- cold arm: nocache=true, validates once before timed loop
- warm arm: prime call fills ristretto cache before b.ResetTimer()
- sampleCollector accumulates ns samples and writes BenchmarkReport JSON
  to KAS_MCP_BENCH_REPORT on flush via b.Cleanup
- validateResult checks IsError, non-empty content, BENCH_MARKER_UNIQUE
  for grep_narrow

Verification: KAS_MCP_BENCH_REPORT=/tmp/performance-test-mcp.json go test
./internal/mcpserver/bench -run '^$' -bench '^BenchmarkMCP$'
-benchtime=1x -count=1 → PASS (4.7s, all 16 sub-benchmarks)
- baseline_bench_test.go: shared cross-arm collector (benchSamples, addBenchSample,
  flushBenchReport) using Task-1 BuildReport/WriteReport API; Task 2 (mcp arm) adds
  arm 'mcp' to the same map via addBenchSample
- BenchmarkDirect: directReadFileWindow mirrors fstools/read.go:37-75;
  parseRgJSONBench mirrors fstools/grep.go:58-97 with MaxGrepMatches cap;
  parseFDOutputBench mirrors fstools/find.go:16-31 with MaxFindResults cap;
  rg/fd invoked directly without shell for no-transport baseline
- BenchmarkBash: sh -lc wrappers around cat -n (read_file), rg (grep), fd (find_files)
  to measure shell spawn + command-parsing overhead; arm label 'bash'
- Both arms use same scenario keys as helpers_test.go for report join by key
- Fail-fast checks ensure no scenario silently returns empty output
- verify: KAS_MCP_BENCH_REPORT=/tmp/performance-test-mcp.json go test
  ./internal/mcpserver/bench -run '^$' -bench '^(BenchmarkDirect|BenchmarkBash)$'
  -benchtime=1x -count=1 — all 16 sub-benchmarks PASS
…warm for ratios (reviewer self-fix)

- Remove private sampleCollector from mcp_bench_test.go; MCP arm now calls
  addBenchSample(key, "mcp_cold"/"mcp_warm", ...) into the package-level map
- Add flushBenchReport() call at the end of BenchmarkMCP so all three arms
  flush the same shared state
- Update BuildReport to search for "mcp_warm" as primary MCP numerator,
  with fallback to "mcp" for backward compatibility
- Update report_test.go to use mcp_cold/mcp_warm arm names; add
  legacy_mcp_arm_fallback table case
- Fix stale "sampleCollector" reference in BenchmarkMCP doc comment
Copilot AI review requested due to automatic review settings April 8, 2026 14:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reproducible benchmark suite under internal/mcpserver/bench/ to measure latency overhead of kasmos MCP stdio tools (read_file, grep, find_files) vs direct in-process baselines and shell-command baselines, plus a live-session validation kit.

Changes:

  • Introduces Go benchmark harness + JSON reporting for per-scenario latency percentiles and overhead multipliers.
  • Adds MCP stdio benchmark arm (cold/warm), plus direct and sh -lc baseline arms and a smoke test for required tools.
  • Adds a shell script + markdown guide to time comparable operations in a live Claude Code session.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
.agents/skills/kasmos-architect/SKILL.md Updates architect skill guidance (MCP task/signal usage).
.agents/skills/kasmos-coder/SKILL.md Updates coder skill guidance (task retrieval / transitions).
.agents/skills/kasmos-fixer/SKILL.md Updates fixer skill guidance (task tooling / lifecycle ops).
.agents/skills/kasmos-lifecycle/SKILL.md Updates lifecycle overview guidance for task/signal tools.
.agents/skills/kasmos-master/SKILL.md Updates master reviewer workflow guidance.
.agents/skills/kasmos-planner/SKILL.md Updates planner storage/signaling guidance.
.agents/skills/kasmos-reviewer/SKILL.md Updates reviewer signaling guidance.
.claude/agents/coder.md Adjusts Claude “coder” agent frontmatter/config.
.claude/agents/reviewer.md Adjusts Claude “reviewer” agent frontmatter/config.
internal/mcpserver/bench/SESSION_BENCH.md Manual live-session benchmark procedure and JSON comparison guide.
internal/mcpserver/bench/baseline_bench_test.go Direct + bash baseline benchmark arms and shared sample collection/report flush.
internal/mcpserver/bench/helpers_test.go TestMain harness, fixture generation, and MCP stdio client helper.
internal/mcpserver/bench/mcp_bench_test.go MCP stdio benchmark arm (cold/warm) with lightweight result validation.
internal/mcpserver/bench/report.go Latency stats computation, overhead ratio logic, and JSON report writer.
internal/mcpserver/bench/report_test.go Unit tests for stats, ordering, ratios, and report serialization.
internal/mcpserver/bench/session_bench.sh Shell-side timing harness + optional Go benchmark invocation and jq summary.
internal/mcpserver/bench/smoke_test.go Smoke test that kas mcp exposes required tools.
Comments suppressed due to low confidence (1)

.agents/skills/kasmos-coder/SKILL.md:103

  • The managed-mode instructions now omit project: "$KASMOS_PROJECT" from task_show. In multi-repo mode the MCP tools require a project selection (see internal/mcpserver/tasktools/tasktools.go:410-420), and this change also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-coder/SKILL.md. Please keep the runtime skill and scaffold template synchronized and preserve the project arg guidance.
You execute the full plan sequentially, wave by wave:

1. Use MCP `task_show` (filename: "<plan-file>") to read the stored plan. Check its `**Size**` field to determine wave structure.
2. Create a todo list with one item per task (all `pending`), then begin.
3. Execute one wave at a time: implement each task (TDD), commit, then proceed to next task.
4. After each wave completes, self-review before starting the next wave.
5. When all waves complete, write the implement-finished sentinel (see **Signaling**).

Comment thread .claude/agents/coder.md
Comment on lines 1 to 5
---
name: coder
description: Implementation agent for writing and modifying code
model: claude-sonnet-4-6
model:
---
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model: is now empty, which makes this agent config invalid and also introduces scaffold drift vs internal/initcmd/scaffold/templates/claude/agents/coder.md (uses model: {{MODEL}}). This will likely break scaffold_sync expectations and any harness that requires an explicit model; restore a concrete model or the templated {{MODEL}} form consistently across runtime + scaffold copies.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 1 to 5
---
name: reviewer
description: Code review agent for quality and spec compliance
model: claude-sonnet-4-6
model:
---
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model: is now empty, which makes this agent config invalid and introduces scaffold drift vs internal/initcmd/scaffold/templates/claude/agents/reviewer.md (uses model: {{MODEL}}). Please restore a concrete model or keep the templated {{MODEL}} form and ensure the scaffold/template copy is updated in lockstep.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +73 to +82
- updated plan markdown written with:
use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>", project: "$KASMOS_PROJECT") to persist the rewritten plan.
use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>") to persist the rewritten plan.
- metadata JSON written to `.kasmos/cache/<plan-file>-architect.json` using the raw plan filename slug (for example `.kasmos/cache/skill-prompt-rewrites-architect.json`)

### required commands

1. use MCP `task_show` (filename: "<plan-file>", project: "$KASMOS_PROJECT") to read the latest plan
2. use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>", project: "$KASMOS_PROJECT") to persist the rewritten plan.
1. use MCP `task_show` (filename: "<plan-file>") to read the latest plan
2. use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>") to persist the rewritten plan.
3. mkdir -p .kasmos/cache
4. use MCP `signal_create` (signal_type: "elaborator-finished", plan_file: "<plan-file>", project: "$KASMOS_PROJECT") after the round-trip check succeeds.
4. use MCP `signal_create` (signal_type: "elaborator-finished", plan_file: "<plan-file>") after the round-trip check succeeds.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These skill instructions removed the project: "$KASMOS_PROJECT" argument from MCP task/signal calls, but the MCP server still supports/needs project in multi-repo mode (e.g. internal/mcpserver/tasktools/tasktools.go:416-420, internal/mcpserver/signaltools/signaltools.go:82-88). This also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-architect/SKILL.md; please keep runtime skill + scaffold template in sync and retain/describe the project arg where required.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 290 to 321
For each ghost entry:
1. Show: plan name, status, branch
2. Confirm: "force-set ghost plan `<name>` to cancelled?"
3. Use MCP `task_transition` (`filename: "<name>"`, `event: "cancel"`, `project: "$KASMOS_PROJECT"`) when a valid
3. Use MCP `task_transition` (`filename: "<name>"`, `event: "cancel"`) when a valid
lifecycle event exists; reserve force-style overrides for confirmed recovery work only.

---

## Available Task-Store Tools

Use MCP task-store tools for plan/task reads and lifecycle mutations. Keep `kas`, not `kq`,
for execution paths that genuinely require the CLI (for example `kas task implement`).

### `task_list`

Use MCP `task_list` to inspect task state. It returns status, branch, and topic data and supports
the same lifecycle filters used during triage and cleanup. Always include `project: "$KASMOS_PROJECT"`.
the same lifecycle filters used during triage and cleanup.

- use MCP `task_list` with no status filter for all tasks
- use MCP `task_list` with status `implementing` for active implementation plans
- use MCP `task_list` with status `ready` for plans waiting to start

### `task_show`

Use MCP `task_show` (`filename: "my-plan.md"`, `project: "$KASMOS_PROJECT"`) to read the stored plan markdown from the task
Use MCP `task_show` (`filename: "my-plan.md"`) to read the stored plan markdown from the task
store without relying on disk copies.

### `task_transition`

Use MCP `task_transition` (`filename: "2026-02-27-my-plan.md"`,
`event: "review_approved"`, `project: "$KASMOS_PROJECT"`) to apply named FSM events. Preferred over force-style recovery
`event: "review_approved"`) to apply named FSM events. Preferred over force-style recovery
overrides whenever a valid event exists.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section drops the project: "$KASMOS_PROJECT" guidance for task_transition / task_list / task_show, but the MCP server exposes project as the routing knob and requires it in multi-repo mode (internal/mcpserver/tasktools/tasktools.go:410-451). This also diverges from internal/initcmd/scaffold/templates/skills/kasmos-fixer/SKILL.md; please update both copies together and retain the project argument guidance for correctness.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 24 to 45
@@ -41,7 +41,7 @@ Examples:
3. The sentinel file is consumed (deleted) after processing — do not rely on it persisting
4. Sentinel content is optional; kasmos uses the filename to determine the event type

**Emitting a signal (agent side):** use MCP `signal_create` (including `project: "$KASMOS_PROJECT"`) to emit the equivalent
**Emitting a signal (agent side):** use MCP `signal_create` to emit the equivalent
`planner-finished-2026-02-27-feature.md` signal as the last action before yielding control.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifecycle skill now states task_show and signal_create without the project argument. However, both task tools and signal tools accept project and require it in multi-repo mode (internal/mcpserver/tasktools/tasktools.go:416-420, internal/mcpserver/signaltools/signaltools.go:82-88). This change also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-lifecycle/SKILL.md; please keep runtime + scaffold copies in sync and restore the project guidance.

Copilot generated this review using guidance from repository custom instructions.
jq -r '
.operations[] |
" \(.key)\n" +
" mcp p50=\(.arms[] | select(.arm=="mcp") | .latency.p50_ns / 1e6 | round)ms" +
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jq summary is still selecting .arms[] | select(.arm=="mcp"), but the report now emits mcp_cold / mcp_warm arms (see report.go and mcp_bench_test.go). As written, this prints nothing / errors for MCP. Update the selector (likely to mcp_warm, with fallback to legacy mcp if you want backward compat).

Suggested change
" mcp p50=\(.arms[] | select(.arm=="mcp") | .latency.p50_ns / 1e6 | round)ms" +
" mcp p50=\(([.arms[] | select(.arm=="mcp_warm" or .arm=="mcp") | .latency.p50_ns][0] / 1e6) | round)ms" +

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +62
for i in $(seq 1 "$runs"); do
local t0 t1 elapsed
t0=$(date +%s%N)
"$@" > /dev/null 2>&1
t1=$(date +%s%N)
elapsed=$(( (t1 - t0) / 1000000 )) # ns → ms
(( elapsed < min )) && min=$elapsed
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date +%s%N is GNU-date specific; on macOS/BSD date typically prints a literal N, which will break the arithmetic expansion under set -euo pipefail. If this script is expected to run on macOS, switch to a portable nanosecond source (e.g. python -c 'import time; print(time.time_ns())') or detect GNU gdate/fallback to millisecond timing.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +123
Key fields to compare:

| JSON field | meaning |
|------------|---------|
| `.operations[].key` | scenario identifier, e.g. `read_small`, `grep_narrow` |
| `.arms[] where .arm=="mcp" | .latency.p50_ns` | MCP p50 latency (ns) |
| `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 |
| `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 |
| `.mcp_vs_direct` | multiplier: MCP p50 / direct p50 |
| `.mcp_vs_bash` | multiplier: MCP p50 / bash p50 |

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc still references a single MCP arm (.arm=="mcp"), but the JSON report uses mcp_cold / mcp_warm. Please update the field references (and the example jq in Step 4) to use the new arm names and clarify which one the overhead ratios are based on.

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +137
## Step 4 — re-run with cache disabled (optional)

To isolate the ristretto cache benefit, re-run the Go suite with:

```sh
KAS_MCP_NOCACHE=1 KAS_MCP_BENCH_REPORT=/tmp/report_nocache.json \
go test ./internal/mcpserver/bench/... -run '^$' -bench . -benchtime=1x
```
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 4 suggests re-running with KAS_MCP_NOCACHE=1, but (with the current stdio client env construction) that environment variable will also leak into the “warm” MCP arm unless it is explicitly unset/overridden (see helpers_test.go:newMCPStdioClient). Given you already have explicit mcp_cold vs mcp_warm arms, consider removing or rewording this step, or ensure the warm arm forcibly enables caching even when the parent env sets KAS_MCP_NOCACHE=1.

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +165
// Every file contains:
// - exactly one BENCH_MARKER_UNIQUE occurrence (in the first generated file only
// if unique=true, otherwise in the package declaration comment)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above generateGoFile says files without includeUniqueMarker still contain BENCH_MARKER_UNIQUE “in the package declaration comment”, but generateGoFile only emits that marker when includeUniqueMarker==true. Please update the comment to match the actual fixture generation behavior (i.e., only the chosen file contains the unique marker).

Suggested change
// Every file contains:
// - exactly one BENCH_MARKER_UNIQUE occurrence (in the first generated file only
// if unique=true, otherwise in the package declaration comment)
// Generated files contain:
// - a BENCH_MARKER_UNIQUE occurrence only when includeUniqueMarker=true

Copilot uses AI. Check for mistakes.
…nch fixes

- Skip bench suite gracefully when rg/fd not in PATH (CI) instead of os.Exit(1)
- Restore project: "$KASMOS_PROJECT" in all skill MCP examples (reverted by PR branch)
- Restore model fields in .claude/agents/coder.md and reviewer.md
- Fix session_bench.sh: portable nanosecond timestamps (macOS compat), jq arm name mcp→mcp_warm
- Fix SESSION_BENCH.md: reference mcp_cold/mcp_warm arms instead of legacy "mcp"
- Fix helpers_test.go: explicitly set KAS_MCP_NOCACHE=0 to prevent env leak, fix stale comment
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Comment on lines +347 to +352
var shellCmd string
if sc.glob != "" {
shellCmd = fmt.Sprintf("rg --no-messages -g %q %q %q", sc.glob, sc.pattern, searchPath)
} else {
shellCmd = fmt.Sprintf("rg --no-messages %q %q", sc.pattern, searchPath)
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Bash benchmark arm, building the rg command via fmt.Sprintf(... %q ...) will double-escape backslashes in regex patterns (e.g. fmt\.Errorf), which changes the pattern semantics and can cause scenarios like grep_filtered to return zero matches (and fail the benchmark) or measure the wrong workload. Build the sh -lc command using POSIX-safe single-quote escaping (or otherwise ensure the pattern is passed to rg unchanged) so the Bash arm exercises the intended regex.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +100
// Verify required CLI tools — skip the entire suite if missing (CI may not have rg/fd).
for _, tool := range []string{"rg", "fd"} {
if _, err := exec.LookPath(tool); err != nil {
fmt.Fprintf(os.Stderr, "bench: required tool %q not found in PATH, skipping bench suite\n", tool)
os.Exit(0)
}
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestMain exits with status 0 when rg/fd aren't present, which skips all tests in this package (including unit tests like report_test.go that don't depend on those external binaries). Instead of os.Exit(0) here, keep m.Run() executing and have only the specific benchmarks/tests that require rg/fd call Skip when the tools are missing (or gate scenario subsets accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +217 to +222
env := []string{}
if nocache {
env = append(env, "KAS_MCP_NOCACHE=1")
} else {
env = append(env, "KAS_MCP_NOCACHE=0")
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newMCPStdioClient always sets KAS_MCP_NOCACHE=0 for the warm client, which means users cannot actually disable caching for the MCP subprocess via KAS_MCP_NOCACHE=1 as described in the docs/script. Either (a) stop forcing KAS_MCP_NOCACHE=0 and only set the variable when nocache=true (so the parent env can disable cache globally), or (b) update the documentation/report fields to reflect that cache is controlled solely by the benchmark arm selection, not the environment variable.

Copilot uses AI. Check for mistakes.
REPORT_FILE="$(mktemp /tmp/mcp_bench_report_XXXXXX.json)"
echo "── Go benchmark suite ────────────────────────────────────────────────"
echo "report will be written to: ${REPORT_FILE}"
echo "(set KAS_MCP_NOCACHE=1 to disable the ristretto cache arm)"
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script claims KAS_MCP_NOCACHE=1 will “disable the ristretto cache arm”, but the Go benchmark currently runs both mcp_cold and mcp_warm and explicitly forces KAS_MCP_NOCACHE=0 for the warm subprocess. Update this message (or adjust the Go bench to honor the env var) so users don’t get misleading guidance when trying to run a cache-disabled suite.

Suggested change
echo "(set KAS_MCP_NOCACHE=1 to disable the ristretto cache arm)"
echo "(note: this Go suite still runs the warm/cache-enabled arm; use --skip-go-bench for cache-disabled shell timings only)"

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +121
| `.arms[] where .arm=="mcp_warm" | .latency.p50_ns` | MCP warm-path p50 latency (ns) |
| `.arms[] where .arm=="mcp_cold" | .latency.p50_ns` | MCP cold-path p50 latency (ns) |
| `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 |
| `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 |
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the JSON field table, several inline-code entries are missing closing backticks (e.g. the .arms[] where .arm=="mcp_warm" | .latency.p50_ns rows). This breaks Markdown rendering and makes the field references harder to read; close the backticks consistently for each code span.

Suggested change
| `.arms[] where .arm=="mcp_warm" | .latency.p50_ns` | MCP warm-path p50 latency (ns) |
| `.arms[] where .arm=="mcp_cold" | .latency.p50_ns` | MCP cold-path p50 latency (ns) |
| `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 |
| `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 |
| <code>.arms[] where .arm=="mcp_warm" | .latency.p50_ns</code> | MCP warm-path p50 latency (ns) |
| <code>.arms[] where .arm=="mcp_cold" | .latency.p50_ns</code> | MCP cold-path p50 latency (ns) |
| <code>.arms[] where .arm=="direct" | .latency.p50_ns</code> | direct function-call p50 |
| <code>.arms[] where .arm=="bash" | .latency.p50_ns</code> | shell-subprocess p50 |

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +138
## Step 4 — re-run with cache disabled (optional)

To isolate the ristretto cache benefit, re-run the Go suite with:

```sh
KAS_MCP_NOCACHE=1 KAS_MCP_BENCH_REPORT=/tmp/report_nocache.json \
go test ./internal/mcpserver/bench/... -run '^$' -bench . -benchtime=1x
```
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 4 instructs re-running the suite with KAS_MCP_NOCACHE=1 to isolate cache benefit, but the benchmark code currently implements cache isolation via separate mcp_cold/mcp_warm arms and (for warm) forces KAS_MCP_NOCACHE=0 in the subprocess env. Either update these instructions to reference the cold/warm arms in the same report, or change the benchmark so setting KAS_MCP_NOCACHE=1 actually disables caching for the MCP subprocess.

Copilot uses AI. Check for mistakes.
… alignment

- Replace %q with single-quote escaping in bash bench arms to avoid
  double-escaping regex metacharacters (e.g. fmt\.Errorf)
- Use per-test skipIfNoBenchTools instead of os.Exit(0) in TestMain so
  report_test.go still runs when rg/fd are missing (CI)
- Remove KAS_MCP_NOCACHE=0 forcing — let parent env control caching as
  documented; the suite already has separate mcp_cold/mcp_warm arms
- Fix SESSION_BENCH.md: restructured JSON field table to avoid pipe-in-
  backtick markdown breakage, updated Step 4 to reference built-in
  cold/warm arms instead of the obsolete KAS_MCP_NOCACHE env toggle
- Fix session_bench.sh message to match actual behavior
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Comment on lines +115 to +124
// Create deterministic fixture tree inside the package directory so the
// kas mcp subprocess can access files through its sandbox rules.
fixtureRoot = filepath.Join(packageDir, "testdata", "fixtures")
if err := createFixtures(fixtureRoot); err != nil {
fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err)
os.Exit(1)
}

code := m.Run()

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestMain creates fixtures under internal/mcpserver/bench/testdata/fixtures (which are checked in by this PR). Because createFixtures unconditionally rewrites these files, running go test will dirty the working tree and can clobber local changes. Consider generating fixtures into a temp directory (e.g. os.MkdirTemp) and pointing fixtureRoot/cmd.Dir at that, or skipping writes when the expected fixture files already exist and match the generated content.

Suggested change
// Create deterministic fixture tree inside the package directory so the
// kas mcp subprocess can access files through its sandbox rules.
fixtureRoot = filepath.Join(packageDir, "testdata", "fixtures")
if err := createFixtures(fixtureRoot); err != nil {
fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err)
os.Exit(1)
}
code := m.Run()
// Create a deterministic fixture tree in a temporary directory under
// testdata so the kas mcp subprocess can access files through its sandbox
// rules without rewriting checked-in fixtures.
fixtureParent := filepath.Join(packageDir, "testdata")
fixtureRoot, err = os.MkdirTemp(fixtureParent, "fixtures-")
if err != nil {
fmt.Fprintf(os.Stderr, "bench: MkdirTemp fixtures: %v\n", err)
os.Exit(1)
}
if err := createFixtures(fixtureRoot); err != nil {
fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err)
_ = os.RemoveAll(fixtureRoot)
os.Exit(1)
}
code := m.Run()
if err := os.RemoveAll(fixtureRoot); err != nil {
fmt.Fprintf(os.Stderr, "bench: cleanup fixtures: %v\n", err)
}

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +64
report := &BenchmarkReport{
GoVersion: runtime.Version(),
GOOS: runtime.GOOS,
GOARCH: runtime.GOARCH,
NoCache: os.Getenv("KAS_MCP_NOCACHE") == "1",
Operations: ops,
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The report-level NoCache flag is derived from the parent process environment (KAS_MCP_NOCACHE), but the cold-cache behavior here is controlled per-subprocess via newMCPStdioClient(nocache=true). As a result, nocache in the JSON report will be false even when mcp_cold samples are present, which can mislead consumers. Suggest deriving this field from the collected arms (e.g., set true only when the report contains mcp_cold samples and no mcp_warm samples), or remove the field now that cold/warm are represented explicitly as separate arms.

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +270
start := time.Now()
out, err := exec.Command("rg", args...).Output()
elapsed := time.Since(start)
// rg exit 1 = no matches; exit 2 = error with possible partial output.
// We still want to parse whatever came back.
if err != nil && len(out) == 0 {
b.Fatalf("direct grep %s: no output (rg error: %v)", sc.key, err)
}
matches, parseErr := parseRgJSONBench(out)
if parseErr != nil {
b.Fatalf("direct grep %s: parse: %v", sc.key, parseErr)
}
if len(matches) == 0 {
b.Fatalf("direct grep %s: unexpectedly empty matches (fixture generation issue?)", sc.key)
}
addBenchSample(sc.key, "direct", elapsed.Nanoseconds())
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timed interval for the direct grep arm measures only the rg subprocess execution (exec.Command(...).Output()), but excludes parsing/processing (parseRgJSONBench) even though MCP timings include server-side parsing and JSON encoding. This makes the cross-arm ratios skewed. Consider moving the end timestamp to after parsing/validation so the direct arm reflects end-to-end caller-visible latency for a structured grep result.

Copilot uses AI. Check for mistakes.
Comment on lines +287 to +295
start := time.Now()
out, _ := exec.Command("fd", args...).Output() // errors treated as empty
elapsed := time.Since(start)
files := parseFDOutputBench(out)
if len(files) == 0 {
b.Fatalf("direct find_files %s: unexpectedly empty output", sc.key)
}
addBenchSample(sc.key, "direct", elapsed.Nanoseconds())
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to direct grep, the direct find_files timing stops immediately after fd returns and excludes parsing/capping (parseFDOutputBench). MCP timings include this processing inside the tool handler, so the comparison is not apples-to-apples. Include the parse step in the measured interval (or explicitly document that direct timings are subprocess-only).

Copilot uses AI. Check for mistakes.
Comment on lines +326 to +329
// cat -n numbers every line; aligns with the numbered-line output that
// agents read from the built-in Read tool.
sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }
shellCmd := fmt.Sprintf("cat -n %s", sq(path))
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bash read_file baseline uses cat -n <file> (entire file) while the MCP/direct read_file scenarios read a bounded window via from/lines (e.g. 50/200/500 lines). Because the amount of data read differs across arms, the resulting latencies and overhead multipliers aren’t directly comparable. Consider either (a) updating the bash command to read the same line window as the scenario, or (b) changing the read_file scenarios to read the full file to match cat -n.

Suggested change
// cat -n numbers every line; aligns with the numbered-line output that
// agents read from the built-in Read tool.
sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }
shellCmd := fmt.Sprintf("cat -n %s", sq(path))
// Keep numbered output aligned with the built-in Read tool. For scenarios
// that specify a bounded line window, read the same slice instead of the
// entire file so benchmark arms perform comparable work.
sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }
shellCmd := fmt.Sprintf("cat -n %s", sq(path))
if sc.lines > 0 {
from := sc.from
if from < 1 {
from = 1
}
shellCmd = fmt.Sprintf(
`awk 'NR >= %d && NR < %d { printf("%%6d\t%%s\n", NR, $0) }' %s`,
from,
from+sc.lines,
sq(path),
)
}

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +103
// Build the real kas binary once.
kasBinaryPath = filepath.Join(packageDir, "testdata", "kas_bench")
if err := os.MkdirAll(filepath.Dir(kasBinaryPath), 0o755); err != nil {
fmt.Fprintf(os.Stderr, "bench: mkdir testdata: %v\n", err)
os.Exit(1)
}
buildCmd := exec.Command("go", "build", "-o", kasBinaryPath, ".")
buildCmd.Dir = repoRoot
buildCmd.Stdout = os.Stderr
buildCmd.Stderr = os.Stderr
if err := buildCmd.Run(); err != nil {
fmt.Fprintf(os.Stderr, "bench: go build kas: %v\n", err)
os.Exit(1)
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestMain unconditionally builds the kas binary (and generates fixtures) for every go test invocation of this package, including unit-only runs like -run TestBuildReport. This can significantly slow down go test ./... and makes lightweight report/unit tests pay the subprocess setup cost. Consider gating the binary build/fixture generation behind an env var (e.g. only when running benchmarks/smoke tests), or splitting report-only tests into a separate package without TestMain side effects.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants