performance-test-mcp rather than just assuming the mcp is going to be... by brkastner · Pull Request #100 · kastheco/kasmos

brkastner · 2026-04-08T14:36:40Z

summary

description: performance-test-mcp

rather than just assuming the mcp is going to be quicker than letting the agent use standard bash calls, yeah it's great for consistency, but I would like to actually benchmark how fast or slow mcp approach is for read file calls and grep calls. to make sure that while maybe a little bit of extra latency is okay, in the interest of hardening what agents do, as far as bash commands go specifically, and forcing them into newer models and having caching in front and code aware searching. that's worth a little latency, but I would like to actually do a full benchmark on both gpt and claude to see just how well or poorly the mcp server performs for these tasks compared to traditional tooling used by harnesses

goal: Build a reproducible benchmark suite that measures kasmos MCP stdio tool latency (read_file, grep, find_files) against direct function calls (simulating Claude Code built-in tools) and raw Bash subprocess calls, to determine whether the MCP-first policy's latency cost is acceptable relative to its caching, consistency, and code-awareness benefits.
architecture: A Go benchmark package (internal/mcpserver/bench/) using go test -bench that starts a real kas mcp subprocess via stdio transport, exercises tools across file sizes and search patterns, and compares against direct function calls (representing built-in tool overhead) and shell-command baselines (representing Bash tool overhead). A KAS_MCP_NOCACHE=1 toggle isolates application-level cache benefit. Results are emitted as structured JSON with per-operation, per-arm latency percentiles and overhead multipliers. A supplementary shell script and markdown guide provide empirical validation guidance for a live Claude Code session.
tech stack: Go testing.B benchmarks, mcp-go client with stdio transport, ristretto cache (toggled via KAS_MCP_NOCACHE), synthetic runtime-generated fixtures, rg/fd CLI tools
review cycle: 1

tasks

1. benchmark harness, runtime fixtures, and report plumbing
2. MCP stdio benchmark arm
3. direct and Bash baseline arms
4. live-session validation kit

reviewer notes

Approved. Round 2 — both Round 1 blocking findings resolved.

verified fixes

mcp_bench_test.go — private sampleCollector removed; MCP arm now calls addBenchSample(sc.key, "mcp_cold"/"mcp_warm", ...) into the shared package-level map. flushBenchReport() called at end of BenchmarkMCP. Cross-arm JSON report now correct.
report.go:BuildReport — overhead ratios now use mcp_warm P50 as the MCP numerator (falls back to legacy "mcp" arm for backward compat). MCPvsDirect and MCPvsBash produce meaningful values.
report_test.go — test data updated to mcp_cold/mcp_warm; legacy_mcp_arm_fallback table case added.

self-fixed (no coder action needed)

mcp_bench_test.go:107 — stale doc comment referenced sampleCollector after removal; updated to name addBenchSample/benchSamples.

acceptance criteria

scope: pass
no_scope_creep: pass
requirements_met: pass
wave_isolation_ok: pass
plan_goal_achieved: pass

verification

go test ./internal/mcpserver/bench/... -run 'TestSmoke_StdioClientListsRequiredTools|TestBuildReport' -v — 16 tests PASS
gofmt -l ./internal/mcpserver/bench/ — no output
go vet ./internal/mcpserver/bench/... — no output

changes

.agents/skills/kasmos-architect/SKILL.md
.agents/skills/kasmos-coder/SKILL.md
.agents/skills/kasmos-fixer/SKILL.md
.agents/skills/kasmos-lifecycle/SKILL.md
.agents/skills/kasmos-master/SKILL.md
.agents/skills/kasmos-planner/SKILL.md
.agents/skills/kasmos-reviewer/SKILL.md
.claude/agents/coder.md
.claude/agents/reviewer.md
internal/mcpserver/bench/SESSION_BENCH.md
internal/mcpserver/bench/baseline_bench_test.go
internal/mcpserver/bench/helpers_test.go
internal/mcpserver/bench/mcp_bench_test.go
internal/mcpserver/bench/report.go
internal/mcpserver/bench/report_test.go
internal/mcpserver/bench/session_bench.sh
internal/mcpserver/bench/smoke_test.go

commits

a2c22a7 fix: wire MCP arm to shared collector, update BuildReport to use mcp_warm for ratios (reviewer self-fix)
37cd0ce fix: gofmt alignment, jq field names, and doc key format (reviewer self-fix)
4393ec2 feat(task-4): live-session validation kit (session_bench.sh + SESSION_BENCH.md)
46d570e feat(task-3): direct and Bash baseline arms
1537ba9 feat(task-2): MCP stdio benchmark arm with cold/warm sub-benchmarks
bd4e4d6 feat(task-1): benchmark harness, runtime fixtures, and report plumbing

stats

.agents/skills/kasmos-architect/SKILL.md | 16 +-
.agents/skills/kasmos-coder/SKILL.md | 6 +-
.agents/skills/kasmos-fixer/SKILL.md | 8 +-
.agents/skills/kasmos-lifecycle/SKILL.md | 4 +-
.agents/skills/kasmos-master/SKILL.md | 4 +-
.agents/skills/kasmos-planner/SKILL.md | 15 +-
.agents/skills/kasmos-reviewer/SKILL.md | 6 +-
.claude/agents/coder.md | 2 +-
.claude/agents/reviewer.md | 2 +-
internal/mcpserver/bench/SESSION_BENCH.md | 161 ++++++++++
internal/mcpserver/bench/baseline_bench_test.go | 392 ++++++++++++++++++++++++
internal/mcpserver/bench/helpers_test.go | 261 ++++++++++++++++
internal/mcpserver/bench/mcp_bench_test.go | 174 +++++++++++
internal/mcpserver/bench/report.go | 161 ++++++++++
internal/mcpserver/bench/report_test.go | 205 +++++++++++++
internal/mcpserver/bench/session_bench.sh | 135 ++++++++
internal/mcpserver/bench/smoke_test.go | 34 ++
17 files changed, 1555 insertions(+), 31 deletions(-)

- helpers_test.go: scenario catalog (8 scenarios), TestMain builds kas binary, generates deterministic fixture tree, verifies rg/fd in PATH; newMCPStdioClient uses transport.NewStdioWithOptions + WithCommandFunc with fixtureRoot as cmd.Dir - report.go: LatencyStats/ArmStats/OperationReport/BenchmarkReport structs, ComputeLatencyStats (min/p50/p95/p99/max/mean), Multiplier with zero-denom guard, BuildReport with stable arm ordering, WriteReport following orchestration/cache.go style - report_test.go: table-driven tests covering percentile math, multiplier math, zero-denominator handling, arm ordering, and round-trip JSON serialization - smoke_test.go: TestSmoke_StdioClientListsRequiredTools asserts read_file/grep/find_files

Implements BenchmarkMCP with 16 sub-benchmarks (8 scenarios × 2 arms) covering read_file, grep, and find_files tools in both cold (nocache) and warm (cache-primed) configurations. - buildMCPRequest maps scenario fields to exact handler arg names - cold arm: nocache=true, validates once before timed loop - warm arm: prime call fills ristretto cache before b.ResetTimer() - sampleCollector accumulates ns samples and writes BenchmarkReport JSON to KAS_MCP_BENCH_REPORT on flush via b.Cleanup - validateResult checks IsError, non-empty content, BENCH_MARKER_UNIQUE for grep_narrow Verification: KAS_MCP_BENCH_REPORT=/tmp/performance-test-mcp.json go test ./internal/mcpserver/bench -run '^$' -bench '^BenchmarkMCP$' -benchtime=1x -count=1 → PASS (4.7s, all 16 sub-benchmarks)

- baseline_bench_test.go: shared cross-arm collector (benchSamples, addBenchSample, flushBenchReport) using Task-1 BuildReport/WriteReport API; Task 2 (mcp arm) adds arm 'mcp' to the same map via addBenchSample - BenchmarkDirect: directReadFileWindow mirrors fstools/read.go:37-75; parseRgJSONBench mirrors fstools/grep.go:58-97 with MaxGrepMatches cap; parseFDOutputBench mirrors fstools/find.go:16-31 with MaxFindResults cap; rg/fd invoked directly without shell for no-transport baseline - BenchmarkBash: sh -lc wrappers around cat -n (read_file), rg (grep), fd (find_files) to measure shell spawn + command-parsing overhead; arm label 'bash' - Both arms use same scenario keys as helpers_test.go for report join by key - Fail-fast checks ensure no scenario silently returns empty output - verify: KAS_MCP_BENCH_REPORT=/tmp/performance-test-mcp.json go test ./internal/mcpserver/bench -run '^$' -bench '^(BenchmarkDirect|BenchmarkBash)$' -benchtime=1x -count=1 — all 16 sub-benchmarks PASS

…_BENCH.md)

…lf-fix)

…warm for ratios (reviewer self-fix) - Remove private sampleCollector from mcp_bench_test.go; MCP arm now calls addBenchSample(key, "mcp_cold"/"mcp_warm", ...) into the package-level map - Add flushBenchReport() call at the end of BenchmarkMCP so all three arms flush the same shared state - Update BuildReport to search for "mcp_warm" as primary MCP numerator, with fallback to "mcp" for backward compatibility - Update report_test.go to use mcp_cold/mcp_warm arm names; add legacy_mcp_arm_fallback table case - Fix stale "sampleCollector" reference in BenchmarkMCP doc comment

Copilot

Pull request overview

Adds a reproducible benchmark suite under internal/mcpserver/bench/ to measure latency overhead of kasmos MCP stdio tools (read_file, grep, find_files) vs direct in-process baselines and shell-command baselines, plus a live-session validation kit.

Changes:

Introduces Go benchmark harness + JSON reporting for per-scenario latency percentiles and overhead multipliers.
Adds MCP stdio benchmark arm (cold/warm), plus direct and sh -lc baseline arms and a smoke test for required tools.
Adds a shell script + markdown guide to time comparable operations in a live Claude Code session.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
`.agents/skills/kasmos-architect/SKILL.md`	Updates architect skill guidance (MCP task/signal usage).
`.agents/skills/kasmos-coder/SKILL.md`	Updates coder skill guidance (task retrieval / transitions).
`.agents/skills/kasmos-fixer/SKILL.md`	Updates fixer skill guidance (task tooling / lifecycle ops).
`.agents/skills/kasmos-lifecycle/SKILL.md`	Updates lifecycle overview guidance for task/signal tools.
`.agents/skills/kasmos-master/SKILL.md`	Updates master reviewer workflow guidance.
`.agents/skills/kasmos-planner/SKILL.md`	Updates planner storage/signaling guidance.
`.agents/skills/kasmos-reviewer/SKILL.md`	Updates reviewer signaling guidance.
`.claude/agents/coder.md`	Adjusts Claude “coder” agent frontmatter/config.
`.claude/agents/reviewer.md`	Adjusts Claude “reviewer” agent frontmatter/config.
`internal/mcpserver/bench/SESSION_BENCH.md`	Manual live-session benchmark procedure and JSON comparison guide.
`internal/mcpserver/bench/baseline_bench_test.go`	Direct + bash baseline benchmark arms and shared sample collection/report flush.
`internal/mcpserver/bench/helpers_test.go`	TestMain harness, fixture generation, and MCP stdio client helper.
`internal/mcpserver/bench/mcp_bench_test.go`	MCP stdio benchmark arm (cold/warm) with lightweight result validation.
`internal/mcpserver/bench/report.go`	Latency stats computation, overhead ratio logic, and JSON report writer.
`internal/mcpserver/bench/report_test.go`	Unit tests for stats, ordering, ratios, and report serialization.
`internal/mcpserver/bench/session_bench.sh`	Shell-side timing harness + optional Go benchmark invocation and jq summary.
`internal/mcpserver/bench/smoke_test.go`	Smoke test that `kas mcp` exposes required tools.

Comments suppressed due to low confidence (1)

.agents/skills/kasmos-coder/SKILL.md:103

The managed-mode instructions now omit project: "$KASMOS_PROJECT" from task_show. In multi-repo mode the MCP tools require a project selection (see internal/mcpserver/tasktools/tasktools.go:410-420), and this change also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-coder/SKILL.md. Please keep the runtime skill and scaffold template synchronized and preserve the project arg guidance.

You execute the full plan sequentially, wave by wave:

1. Use MCP `task_show` (filename: "<plan-file>") to read the stored plan. Check its `**Size**` field to determine wave structure.
2. Create a todo list with one item per task (all `pending`), then begin.
3. Execute one wave at a time: implement each task (TDD), commit, then proceed to next task.
4. After each wave completes, self-review before starting the next wave.
5. When all waves complete, write the implement-finished sentinel (see **Signaling**).

Copilot · 2026-04-08T14:46:56Z

 ---
 name: coder
 description: Implementation agent for writing and modifying code
-model: claude-sonnet-4-6
+model: 
 ---


model: is now empty, which makes this agent config invalid and also introduces scaffold drift vs internal/initcmd/scaffold/templates/claude/agents/coder.md (uses model: {{MODEL}}). This will likely break scaffold_sync expectations and any harness that requires an explicit model; restore a concrete model or the templated {{MODEL}} form consistently across runtime + scaffold copies.

Copilot · 2026-04-08T14:46:56Z

 ---
 name: reviewer
 description: Code review agent for quality and spec compliance
-model: claude-sonnet-4-6
+model: 
 ---


model: is now empty, which makes this agent config invalid and introduces scaffold drift vs internal/initcmd/scaffold/templates/claude/agents/reviewer.md (uses model: {{MODEL}}). Please restore a concrete model or keep the templated {{MODEL}} form and ensure the scaffold/template copy is updated in lockstep.

Copilot · 2026-04-08T14:46:57Z

 - updated plan markdown written with: 
-  use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>", project: "$KASMOS_PROJECT") to persist the rewritten plan.
+  use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>") to persist the rewritten plan.
 - metadata JSON written to `.kasmos/cache/<plan-file>-architect.json` using the raw plan filename slug (for example `.kasmos/cache/skill-prompt-rewrites-architect.json`)

 ### required commands

-1. use MCP `task_show` (filename: "<plan-file>", project: "$KASMOS_PROJECT") to read the latest plan
-2. use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>", project: "$KASMOS_PROJECT") to persist the rewritten plan.
+1. use MCP `task_show` (filename: "<plan-file>") to read the latest plan
+2. use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>") to persist the rewritten plan.
 3. mkdir -p .kasmos/cache
-4. use MCP `signal_create` (signal_type: "elaborator-finished", plan_file: "<plan-file>", project: "$KASMOS_PROJECT") after the round-trip check succeeds.
+4. use MCP `signal_create` (signal_type: "elaborator-finished", plan_file: "<plan-file>") after the round-trip check succeeds.


These skill instructions removed the project: "$KASMOS_PROJECT" argument from MCP task/signal calls, but the MCP server still supports/needs project in multi-repo mode (e.g. internal/mcpserver/tasktools/tasktools.go:416-420, internal/mcpserver/signaltools/signaltools.go:82-88). This also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-architect/SKILL.md; please keep runtime skill + scaffold template in sync and retain/describe the project arg where required.

Copilot · 2026-04-08T14:46:57Z

 For each ghost entry:
 1. Show: plan name, status, branch
 2. Confirm: "force-set ghost plan `<name>` to cancelled?"
-3. Use MCP `task_transition` (`filename: "<name>"`, `event: "cancel"`, `project: "$KASMOS_PROJECT"`) when a valid
+3. Use MCP `task_transition` (`filename: "<name>"`, `event: "cancel"`) when a valid
   lifecycle event exists; reserve force-style overrides for confirmed recovery work only.

 ---

 ## Available Task-Store Tools

 Use MCP task-store tools for plan/task reads and lifecycle mutations. Keep `kas`, not `kq`,
 for execution paths that genuinely require the CLI (for example `kas task implement`).

 ### `task_list`

 Use MCP `task_list` to inspect task state. It returns status, branch, and topic data and supports
-the same lifecycle filters used during triage and cleanup. Always include `project: "$KASMOS_PROJECT"`.
+the same lifecycle filters used during triage and cleanup.

 - use MCP `task_list` with no status filter for all tasks
 - use MCP `task_list` with status `implementing` for active implementation plans
 - use MCP `task_list` with status `ready` for plans waiting to start

 ### `task_show`

-Use MCP `task_show` (`filename: "my-plan.md"`, `project: "$KASMOS_PROJECT"`) to read the stored plan markdown from the task
+Use MCP `task_show` (`filename: "my-plan.md"`) to read the stored plan markdown from the task
 store without relying on disk copies.

 ### `task_transition`

 Use MCP `task_transition` (`filename: "2026-02-27-my-plan.md"`,
-`event: "review_approved"`, `project: "$KASMOS_PROJECT"`) to apply named FSM events. Preferred over force-style recovery
+`event: "review_approved"`) to apply named FSM events. Preferred over force-style recovery
 overrides whenever a valid event exists.


This section drops the project: "$KASMOS_PROJECT" guidance for task_transition / task_list / task_show, but the MCP server exposes project as the routing knob and requires it in multi-repo mode (internal/mcpserver/tasktools/tasktools.go:410-451). This also diverges from internal/initcmd/scaffold/templates/skills/kasmos-fixer/SKILL.md; please update both copies together and retain the project argument guidance for correctness.

Copilot · 2026-04-08T14:46:57Z

@@ -41,7 +41,7 @@ Examples:
 3. The sentinel file is consumed (deleted) after processing — do not rely on it persisting
 4. Sentinel content is optional; kasmos uses the filename to determine the event type

-**Emitting a signal (agent side):** use MCP `signal_create` (including `project: "$KASMOS_PROJECT"`) to emit the equivalent
+**Emitting a signal (agent side):** use MCP `signal_create` to emit the equivalent
 `planner-finished-2026-02-27-feature.md` signal as the last action before yielding control.


The lifecycle skill now states task_show and signal_create without the project argument. However, both task tools and signal tools accept project and require it in multi-repo mode (internal/mcpserver/tasktools/tasktools.go:416-420, internal/mcpserver/signaltools/signaltools.go:82-88). This change also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-lifecycle/SKILL.md; please keep runtime + scaffold copies in sync and restore the project guidance.

Copilot · 2026-04-08T14:46:59Z

+      jq -r '
+        .operations[] |
+        "  \(.key)\n" +
+        "    mcp  p50=\(.arms[] | select(.arm=="mcp") | .latency.p50_ns / 1e6 | round)ms" +


The jq summary is still selecting .arms[] | select(.arm=="mcp"), but the report now emits mcp_cold / mcp_warm arms (see report.go and mcp_bench_test.go). As written, this prints nothing / errors for MCP. Update the selector (likely to mcp_warm, with fallback to legacy mcp if you want backward compat).

Suggested change

" mcp p50=\(.arms[] | select(.arm=="mcp") | .latency.p50_ns / 1e6 | round)ms" +

" mcp p50=\(([.arms[] | select(.arm=="mcp_warm" or .arm=="mcp") | .latency.p50_ns][0] / 1e6) | round)ms" +

Copilot · 2026-04-08T14:46:59Z

+  for i in $(seq 1 "$runs"); do
+    local t0 t1 elapsed
+    t0=$(date +%s%N)
+    "$@" > /dev/null 2>&1
+    t1=$(date +%s%N)
+    elapsed=$(( (t1 - t0) / 1000000 ))  # ns → ms
+    (( elapsed < min )) && min=$elapsed


date +%s%N is GNU-date specific; on macOS/BSD date typically prints a literal N, which will break the arithmetic expansion under set -euo pipefail. If this script is expected to run on macOS, switch to a portable nanosecond source (e.g. python -c 'import time; print(time.time_ns())') or detect GNU gdate/fallback to millisecond timing.

Copilot · 2026-04-08T14:47:00Z

+Key fields to compare:
+
+| JSON field | meaning |
+|------------|---------|
+| `.operations[].key` | scenario identifier, e.g. `read_small`, `grep_narrow` |
+| `.arms[] where .arm=="mcp" | .latency.p50_ns` | MCP p50 latency (ns) |
+| `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 |
+| `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 |
+| `.mcp_vs_direct` | multiplier: MCP p50 / direct p50 |
+| `.mcp_vs_bash` | multiplier: MCP p50 / bash p50 |
+


This doc still references a single MCP arm (.arm=="mcp"), but the JSON report uses mcp_cold / mcp_warm. Please update the field references (and the example jq in Step 4) to use the new arm names and clarify which one the overhead ratios are based on.

Copilot · 2026-04-08T14:47:00Z

+## Step 4 — re-run with cache disabled (optional)
+
+To isolate the ristretto cache benefit, re-run the Go suite with:
+
+```sh
+KAS_MCP_NOCACHE=1 KAS_MCP_BENCH_REPORT=/tmp/report_nocache.json \
+  go test ./internal/mcpserver/bench/... -run '^$' -bench . -benchtime=1x
+```


Step 4 suggests re-running with KAS_MCP_NOCACHE=1, but (with the current stdio client env construction) that environment variable will also leak into the “warm” MCP arm unless it is explicitly unset/overridden (see helpers_test.go:newMCPStdioClient). Given you already have explicit mcp_cold vs mcp_warm arms, consider removing or rewording this step, or ensure the warm arm forcibly enables caching even when the parent env sets KAS_MCP_NOCACHE=1.

Copilot · 2026-04-08T14:47:00Z

+// Every file contains:
+//   - exactly one BENCH_MARKER_UNIQUE occurrence (in the first generated file only
+//     if unique=true, otherwise in the package declaration comment)


The comment above generateGoFile says files without includeUniqueMarker still contain BENCH_MARKER_UNIQUE “in the package declaration comment”, but generateGoFile only emits that marker when includeUniqueMarker==true. Please update the comment to match the actual fixture generation behavior (i.e., only the chosen file contains the unique marker).

Suggested change

// Every file contains:

// - exactly one BENCH_MARKER_UNIQUE occurrence (in the first generated file only

// if unique=true, otherwise in the package declaration comment)

// Generated files contain:

// - a BENCH_MARKER_UNIQUE occurrence only when includeUniqueMarker=true

…-mcp

…nch fixes - Skip bench suite gracefully when rg/fd not in PATH (CI) instead of os.Exit(1) - Restore project: "$KASMOS_PROJECT" in all skill MCP examples (reverted by PR branch) - Restore model fields in .claude/agents/coder.md and reviewer.md - Fix session_bench.sh: portable nanosecond timestamps (macOS compat), jq arm name mcp→mcp_warm - Fix SESSION_BENCH.md: reference mcp_cold/mcp_warm arms instead of legacy "mcp" - Fix helpers_test.go: explicitly set KAS_MCP_NOCACHE=0 to prevent env leak, fix stale comment

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Copilot · 2026-04-08T21:39:44Z

+	var shellCmd string
+	if sc.glob != "" {
+		shellCmd = fmt.Sprintf("rg --no-messages -g %q %q %q", sc.glob, sc.pattern, searchPath)
+	} else {
+		shellCmd = fmt.Sprintf("rg --no-messages %q %q", sc.pattern, searchPath)
+	}


In the Bash benchmark arm, building the rg command via fmt.Sprintf(... %q ...) will double-escape backslashes in regex patterns (e.g. fmt\.Errorf), which changes the pattern semantics and can cause scenarios like grep_filtered to return zero matches (and fail the benchmark) or measure the wrong workload. Build the sh -lc command using POSIX-safe single-quote escaping (or otherwise ensure the pattern is passed to rg unchanged) so the Bash arm exercises the intended regex.

Copilot · 2026-04-08T21:39:44Z

+	// Verify required CLI tools — skip the entire suite if missing (CI may not have rg/fd).
+	for _, tool := range []string{"rg", "fd"} {
+		if _, err := exec.LookPath(tool); err != nil {
+			fmt.Fprintf(os.Stderr, "bench: required tool %q not found in PATH, skipping bench suite\n", tool)
+			os.Exit(0)
+		}
+	}


TestMain exits with status 0 when rg/fd aren't present, which skips all tests in this package (including unit tests like report_test.go that don't depend on those external binaries). Instead of os.Exit(0) here, keep m.Run() executing and have only the specific benchmarks/tests that require rg/fd call Skip when the tools are missing (or gate scenario subsets accordingly).

Copilot · 2026-04-08T21:39:45Z

+	env := []string{}
+	if nocache {
+		env = append(env, "KAS_MCP_NOCACHE=1")
+	} else {
+		env = append(env, "KAS_MCP_NOCACHE=0")
+	}


newMCPStdioClient always sets KAS_MCP_NOCACHE=0 for the warm client, which means users cannot actually disable caching for the MCP subprocess via KAS_MCP_NOCACHE=1 as described in the docs/script. Either (a) stop forcing KAS_MCP_NOCACHE=0 and only set the variable when nocache=true (so the parent env can disable cache globally), or (b) update the documentation/report fields to reflect that cache is controlled solely by the benchmark arm selection, not the environment variable.

Copilot · 2026-04-08T21:39:45Z

+  REPORT_FILE="$(mktemp /tmp/mcp_bench_report_XXXXXX.json)"
+  echo "── Go benchmark suite ────────────────────────────────────────────────"
+  echo "report will be written to: ${REPORT_FILE}"
+  echo "(set KAS_MCP_NOCACHE=1 to disable the ristretto cache arm)"


The script claims KAS_MCP_NOCACHE=1 will “disable the ristretto cache arm”, but the Go benchmark currently runs both mcp_cold and mcp_warm and explicitly forces KAS_MCP_NOCACHE=0 for the warm subprocess. Update this message (or adjust the Go bench to honor the env var) so users don’t get misleading guidance when trying to run a cache-disabled suite.

Suggested change

echo "(set KAS_MCP_NOCACHE=1 to disable the ristretto cache arm)"

echo "(note: this Go suite still runs the warm/cache-enabled arm; use --skip-go-bench for cache-disabled shell timings only)"

Copilot · 2026-04-08T21:39:45Z

+| `.arms[] where .arm=="mcp_warm" | .latency.p50_ns` | MCP warm-path p50 latency (ns) |
+| `.arms[] where .arm=="mcp_cold" | .latency.p50_ns` | MCP cold-path p50 latency (ns) |
+| `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 |
+| `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 |


Copilot · 2026-04-08T21:39:45Z

+## Step 4 — re-run with cache disabled (optional)
+
+To isolate the ristretto cache benefit, re-run the Go suite with:
+
+```sh
+KAS_MCP_NOCACHE=1 KAS_MCP_BENCH_REPORT=/tmp/report_nocache.json \
+  go test ./internal/mcpserver/bench/... -run '^$' -bench . -benchtime=1x
+```


Step 4 instructs re-running the suite with KAS_MCP_NOCACHE=1 to isolate cache benefit, but the benchmark code currently implements cache isolation via separate mcp_cold/mcp_warm arms and (for warm) forces KAS_MCP_NOCACHE=0 in the subprocess env. Either update these instructions to reference the cold/warm arms in the same report, or change the benchmark so setting KAS_MCP_NOCACHE=1 actually disables caching for the MCP subprocess.

… alignment - Replace %q with single-quote escaping in bash bench arms to avoid double-escaping regex metacharacters (e.g. fmt\.Errorf) - Use per-test skipIfNoBenchTools instead of os.Exit(0) in TestMain so report_test.go still runs when rg/fd are missing (CI) - Remove KAS_MCP_NOCACHE=0 forcing — let parent env control caching as documented; the suite already has separate mcp_cold/mcp_warm arms - Fix SESSION_BENCH.md: restructured JSON field table to avoid pipe-in- backtick markdown breakage, updated Step 4 to reference built-in cold/warm arms instead of the obsolete KAS_MCP_NOCACHE env toggle - Fix session_bench.sh message to match actual behavior

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Copilot · 2026-04-08T22:18:21Z

+	// Create deterministic fixture tree inside the package directory so the
+	// kas mcp subprocess can access files through its sandbox rules.
+	fixtureRoot = filepath.Join(packageDir, "testdata", "fixtures")
+	if err := createFixtures(fixtureRoot); err != nil {
+		fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err)
+		os.Exit(1)
+	}
+
+	code := m.Run()
+


TestMain creates fixtures under internal/mcpserver/bench/testdata/fixtures (which are checked in by this PR). Because createFixtures unconditionally rewrites these files, running go test will dirty the working tree and can clobber local changes. Consider generating fixtures into a temp directory (e.g. os.MkdirTemp) and pointing fixtureRoot/cmd.Dir at that, or skipping writes when the expected fixture files already exist and match the generated content.

Suggested change

// Create deterministic fixture tree inside the package directory so the

// kas mcp subprocess can access files through its sandbox rules.

fixtureRoot = filepath.Join(packageDir, "testdata", "fixtures")

if err := createFixtures(fixtureRoot); err != nil {

fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err)

os.Exit(1)

}

code := m.Run()

// Create a deterministic fixture tree in a temporary directory under

// testdata so the kas mcp subprocess can access files through its sandbox

// rules without rewriting checked-in fixtures.

fixtureParent := filepath.Join(packageDir, "testdata")

fixtureRoot, err = os.MkdirTemp(fixtureParent, "fixtures-")

if err != nil {

fmt.Fprintf(os.Stderr, "bench: MkdirTemp fixtures: %v\n", err)

os.Exit(1)

}

if err := createFixtures(fixtureRoot); err != nil {

fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err)

_ = os.RemoveAll(fixtureRoot)

os.Exit(1)

}

code := m.Run()

if err := os.RemoveAll(fixtureRoot); err != nil {

fmt.Fprintf(os.Stderr, "bench: cleanup fixtures: %v\n", err)

}

Copilot · 2026-04-08T22:18:21Z

+	report := &BenchmarkReport{
+		GoVersion:  runtime.Version(),
+		GOOS:       runtime.GOOS,
+		GOARCH:     runtime.GOARCH,
+		NoCache:    os.Getenv("KAS_MCP_NOCACHE") == "1",
+		Operations: ops,
+	}


The report-level NoCache flag is derived from the parent process environment (KAS_MCP_NOCACHE), but the cold-cache behavior here is controlled per-subprocess via newMCPStdioClient(nocache=true). As a result, nocache in the JSON report will be false even when mcp_cold samples are present, which can mislead consumers. Suggest deriving this field from the collected arms (e.g., set true only when the report contains mcp_cold samples and no mcp_warm samples), or remove the field now that cold/warm are represented explicitly as separate arms.

Copilot · 2026-04-08T22:18:21Z

+		start := time.Now()
+		out, err := exec.Command("rg", args...).Output()
+		elapsed := time.Since(start)
+		// rg exit 1 = no matches; exit 2 = error with possible partial output.
+		// We still want to parse whatever came back.
+		if err != nil && len(out) == 0 {
+			b.Fatalf("direct grep %s: no output (rg error: %v)", sc.key, err)
+		}
+		matches, parseErr := parseRgJSONBench(out)
+		if parseErr != nil {
+			b.Fatalf("direct grep %s: parse: %v", sc.key, parseErr)
+		}
+		if len(matches) == 0 {
+			b.Fatalf("direct grep %s: unexpectedly empty matches (fixture generation issue?)", sc.key)
+		}
+		addBenchSample(sc.key, "direct", elapsed.Nanoseconds())


The timed interval for the direct grep arm measures only the rg subprocess execution (exec.Command(...).Output()), but excludes parsing/processing (parseRgJSONBench) even though MCP timings include server-side parsing and JSON encoding. This makes the cross-arm ratios skewed. Consider moving the end timestamp to after parsing/validation so the direct arm reflects end-to-end caller-visible latency for a structured grep result.

Copilot · 2026-04-08T22:18:22Z

+		start := time.Now()
+		out, _ := exec.Command("fd", args...).Output() // errors treated as empty
+		elapsed := time.Since(start)
+		files := parseFDOutputBench(out)
+		if len(files) == 0 {
+			b.Fatalf("direct find_files %s: unexpectedly empty output", sc.key)
+		}
+		addBenchSample(sc.key, "direct", elapsed.Nanoseconds())
+	}


Similar to direct grep, the direct find_files timing stops immediately after fd returns and excludes parsing/capping (parseFDOutputBench). MCP timings include this processing inside the tool handler, so the comparison is not apples-to-apples. Include the parse step in the measured interval (or explicitly document that direct timings are subprocess-only).

Copilot · 2026-04-08T22:18:22Z

+	// cat -n numbers every line; aligns with the numbered-line output that
+	// agents read from the built-in Read tool.
+	sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }
+	shellCmd := fmt.Sprintf("cat -n %s", sq(path))


The bash read_file baseline uses cat -n <file> (entire file) while the MCP/direct read_file scenarios read a bounded window via from/lines (e.g. 50/200/500 lines). Because the amount of data read differs across arms, the resulting latencies and overhead multipliers aren’t directly comparable. Consider either (a) updating the bash command to read the same line window as the scenario, or (b) changing the read_file scenarios to read the full file to match cat -n.

Suggested change

// cat -n numbers every line; aligns with the numbered-line output that

// agents read from the built-in Read tool.

sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }

shellCmd := fmt.Sprintf("cat -n %s", sq(path))

// Keep numbered output aligned with the built-in Read tool. For scenarios

// that specify a bounded line window, read the same slice instead of the

// entire file so benchmark arms perform comparable work.

sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }

shellCmd := fmt.Sprintf("cat -n %s", sq(path))

if sc.lines > 0 {

from := sc.from

if from < 1 {

from = 1

}

shellCmd = fmt.Sprintf(

`awk 'NR >= %d && NR < %d { printf("%%6d\t%%s\n", NR, $0) }' %s`,

from,

from+sc.lines,

sq(path),

)

}

Copilot · 2026-04-08T22:18:22Z

+	// Build the real kas binary once.
+	kasBinaryPath = filepath.Join(packageDir, "testdata", "kas_bench")
+	if err := os.MkdirAll(filepath.Dir(kasBinaryPath), 0o755); err != nil {
+		fmt.Fprintf(os.Stderr, "bench: mkdir testdata: %v\n", err)
+		os.Exit(1)
+	}
+	buildCmd := exec.Command("go", "build", "-o", kasBinaryPath, ".")
+	buildCmd.Dir = repoRoot
+	buildCmd.Stdout = os.Stderr
+	buildCmd.Stderr = os.Stderr
+	if err := buildCmd.Run(); err != nil {
+		fmt.Fprintf(os.Stderr, "bench: go build kas: %v\n", err)
+		os.Exit(1)
+	}


TestMain unconditionally builds the kas binary (and generates fixtures) for every go test invocation of this package, including unit-only runs like -run TestBuildReport. This can significantly slow down go test ./... and makes lightweight report/unit tests pay the subprocess setup cost. Consider gating the binary build/fixture generation behind an env var (e.g. only when running benchmarks/smoke tests), or splitting report-only tests into a separate package without TestMain side effects.

brkastner added 7 commits April 8, 2026 08:54

feat(task-4): live-session validation kit (session_bench.sh + SESSION…

4393ec2

…_BENCH.md)

fix: gofmt alignment, jq field names, and doc key format (reviewer se…

37cd0ce

…lf-fix)

[kas] implementation of 'performance-test-mcp'

d6a8202

Copilot AI review requested due to automatic review settings April 8, 2026 14:36

Copilot started reviewing on behalf of brkastner April 8, 2026 14:37 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

brkastner added 2 commits April 8, 2026 16:25

Merge remote-tracking branch 'origin/main' into plan/performance-test…

770eee4

…-mcp

brkastner requested a review from Copilot April 8, 2026 21:31

Copilot started reviewing on behalf of brkastner April 8, 2026 21:31 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

brkastner requested a review from Copilot April 8, 2026 22:12

Copilot started reviewing on behalf of brkastner April 8, 2026 22:12 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

	" mcp p50=\(.arms[] \| select(.arm=="mcp") \| .latency.p50_ns / 1e6 \| round)ms" +
	" mcp p50=\(([.arms[] \| select(.arm=="mcp_warm" or .arm=="mcp") \| .latency.p50_ns][0] / 1e6) \| round)ms" +

	echo "(set KAS_MCP_NOCACHE=1 to disable the ristretto cache arm)"
	echo "(note: this Go suite still runs the warm/cache-enabled arm; use --skip-go-bench for cache-disabled shell timings only)"

-	// cat -n numbers every line; aligns with the numbered-line output that
-	// agents read from the built-in Read tool.
-	sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }
-	shellCmd := fmt.Sprintf("cat -n %s", sq(path))
+	// Keep numbered output aligned with the built-in Read tool. For scenarios
+	// that specify a bounded line window, read the same slice instead of the
+	// entire file so benchmark arms perform comparable work.
+	sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" }
+	shellCmd := fmt.Sprintf("cat -n %s", sq(path))
+	if sc.lines > 0 {
+		from := sc.from
+		if from < 1 {
+			from = 1
+		}
+		shellCmd = fmt.Sprintf(
+			`awk 'NR >= %d && NR < %d { printf("%%6d\t%%s\n", NR, $0) }' %s`,
+			from,
+			from+sc.lines,
+			sq(path),
+		)
+	}

Conversation

brkastner commented Apr 8, 2026

summary

tasks

reviewer notes

verified fixes

self-fixed (no coder action needed)

acceptance criteria

verification

changes

commits

stats

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!