performance-test-mcp rather than just assuming the mcp is going to be...#100
performance-test-mcp
rather than just assuming the mcp is going to be...#100brkastner wants to merge 10 commits into
Conversation
- helpers_test.go: scenario catalog (8 scenarios), TestMain builds kas binary, generates deterministic fixture tree, verifies rg/fd in PATH; newMCPStdioClient uses transport.NewStdioWithOptions + WithCommandFunc with fixtureRoot as cmd.Dir - report.go: LatencyStats/ArmStats/OperationReport/BenchmarkReport structs, ComputeLatencyStats (min/p50/p95/p99/max/mean), Multiplier with zero-denom guard, BuildReport with stable arm ordering, WriteReport following orchestration/cache.go style - report_test.go: table-driven tests covering percentile math, multiplier math, zero-denominator handling, arm ordering, and round-trip JSON serialization - smoke_test.go: TestSmoke_StdioClientListsRequiredTools asserts read_file/grep/find_files
Implements BenchmarkMCP with 16 sub-benchmarks (8 scenarios × 2 arms) covering read_file, grep, and find_files tools in both cold (nocache) and warm (cache-primed) configurations. - buildMCPRequest maps scenario fields to exact handler arg names - cold arm: nocache=true, validates once before timed loop - warm arm: prime call fills ristretto cache before b.ResetTimer() - sampleCollector accumulates ns samples and writes BenchmarkReport JSON to KAS_MCP_BENCH_REPORT on flush via b.Cleanup - validateResult checks IsError, non-empty content, BENCH_MARKER_UNIQUE for grep_narrow Verification: KAS_MCP_BENCH_REPORT=/tmp/performance-test-mcp.json go test ./internal/mcpserver/bench -run '^$' -bench '^BenchmarkMCP$' -benchtime=1x -count=1 → PASS (4.7s, all 16 sub-benchmarks)
- baseline_bench_test.go: shared cross-arm collector (benchSamples, addBenchSample, flushBenchReport) using Task-1 BuildReport/WriteReport API; Task 2 (mcp arm) adds arm 'mcp' to the same map via addBenchSample - BenchmarkDirect: directReadFileWindow mirrors fstools/read.go:37-75; parseRgJSONBench mirrors fstools/grep.go:58-97 with MaxGrepMatches cap; parseFDOutputBench mirrors fstools/find.go:16-31 with MaxFindResults cap; rg/fd invoked directly without shell for no-transport baseline - BenchmarkBash: sh -lc wrappers around cat -n (read_file), rg (grep), fd (find_files) to measure shell spawn + command-parsing overhead; arm label 'bash' - Both arms use same scenario keys as helpers_test.go for report join by key - Fail-fast checks ensure no scenario silently returns empty output - verify: KAS_MCP_BENCH_REPORT=/tmp/performance-test-mcp.json go test ./internal/mcpserver/bench -run '^$' -bench '^(BenchmarkDirect|BenchmarkBash)$' -benchtime=1x -count=1 — all 16 sub-benchmarks PASS
…warm for ratios (reviewer self-fix) - Remove private sampleCollector from mcp_bench_test.go; MCP arm now calls addBenchSample(key, "mcp_cold"/"mcp_warm", ...) into the package-level map - Add flushBenchReport() call at the end of BenchmarkMCP so all three arms flush the same shared state - Update BuildReport to search for "mcp_warm" as primary MCP numerator, with fallback to "mcp" for backward compatibility - Update report_test.go to use mcp_cold/mcp_warm arm names; add legacy_mcp_arm_fallback table case - Fix stale "sampleCollector" reference in BenchmarkMCP doc comment
There was a problem hiding this comment.
Pull request overview
Adds a reproducible benchmark suite under internal/mcpserver/bench/ to measure latency overhead of kasmos MCP stdio tools (read_file, grep, find_files) vs direct in-process baselines and shell-command baselines, plus a live-session validation kit.
Changes:
- Introduces Go benchmark harness + JSON reporting for per-scenario latency percentiles and overhead multipliers.
- Adds MCP stdio benchmark arm (cold/warm), plus direct and
sh -lcbaseline arms and a smoke test for required tools. - Adds a shell script + markdown guide to time comparable operations in a live Claude Code session.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
.agents/skills/kasmos-architect/SKILL.md |
Updates architect skill guidance (MCP task/signal usage). |
.agents/skills/kasmos-coder/SKILL.md |
Updates coder skill guidance (task retrieval / transitions). |
.agents/skills/kasmos-fixer/SKILL.md |
Updates fixer skill guidance (task tooling / lifecycle ops). |
.agents/skills/kasmos-lifecycle/SKILL.md |
Updates lifecycle overview guidance for task/signal tools. |
.agents/skills/kasmos-master/SKILL.md |
Updates master reviewer workflow guidance. |
.agents/skills/kasmos-planner/SKILL.md |
Updates planner storage/signaling guidance. |
.agents/skills/kasmos-reviewer/SKILL.md |
Updates reviewer signaling guidance. |
.claude/agents/coder.md |
Adjusts Claude “coder” agent frontmatter/config. |
.claude/agents/reviewer.md |
Adjusts Claude “reviewer” agent frontmatter/config. |
internal/mcpserver/bench/SESSION_BENCH.md |
Manual live-session benchmark procedure and JSON comparison guide. |
internal/mcpserver/bench/baseline_bench_test.go |
Direct + bash baseline benchmark arms and shared sample collection/report flush. |
internal/mcpserver/bench/helpers_test.go |
TestMain harness, fixture generation, and MCP stdio client helper. |
internal/mcpserver/bench/mcp_bench_test.go |
MCP stdio benchmark arm (cold/warm) with lightweight result validation. |
internal/mcpserver/bench/report.go |
Latency stats computation, overhead ratio logic, and JSON report writer. |
internal/mcpserver/bench/report_test.go |
Unit tests for stats, ordering, ratios, and report serialization. |
internal/mcpserver/bench/session_bench.sh |
Shell-side timing harness + optional Go benchmark invocation and jq summary. |
internal/mcpserver/bench/smoke_test.go |
Smoke test that kas mcp exposes required tools. |
Comments suppressed due to low confidence (1)
.agents/skills/kasmos-coder/SKILL.md:103
- The managed-mode instructions now omit
project: "$KASMOS_PROJECT"fromtask_show. In multi-repo mode the MCP tools require a project selection (seeinternal/mcpserver/tasktools/tasktools.go:410-420), and this change also creates scaffold drift vsinternal/initcmd/scaffold/templates/skills/kasmos-coder/SKILL.md. Please keep the runtime skill and scaffold template synchronized and preserve theprojectarg guidance.
You execute the full plan sequentially, wave by wave:
1. Use MCP `task_show` (filename: "<plan-file>") to read the stored plan. Check its `**Size**` field to determine wave structure.
2. Create a todo list with one item per task (all `pending`), then begin.
3. Execute one wave at a time: implement each task (TDD), commit, then proceed to next task.
4. After each wave completes, self-review before starting the next wave.
5. When all waves complete, write the implement-finished sentinel (see **Signaling**).
| --- | ||
| name: coder | ||
| description: Implementation agent for writing and modifying code | ||
| model: claude-sonnet-4-6 | ||
| model: | ||
| --- |
There was a problem hiding this comment.
model: is now empty, which makes this agent config invalid and also introduces scaffold drift vs internal/initcmd/scaffold/templates/claude/agents/coder.md (uses model: {{MODEL}}). This will likely break scaffold_sync expectations and any harness that requires an explicit model; restore a concrete model or the templated {{MODEL}} form consistently across runtime + scaffold copies.
| --- | ||
| name: reviewer | ||
| description: Code review agent for quality and spec compliance | ||
| model: claude-sonnet-4-6 | ||
| model: | ||
| --- |
There was a problem hiding this comment.
model: is now empty, which makes this agent config invalid and introduces scaffold drift vs internal/initcmd/scaffold/templates/claude/agents/reviewer.md (uses model: {{MODEL}}). Please restore a concrete model or keep the templated {{MODEL}} form and ensure the scaffold/template copy is updated in lockstep.
| - updated plan markdown written with: | ||
| use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>", project: "$KASMOS_PROJECT") to persist the rewritten plan. | ||
| use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>") to persist the rewritten plan. | ||
| - metadata JSON written to `.kasmos/cache/<plan-file>-architect.json` using the raw plan filename slug (for example `.kasmos/cache/skill-prompt-rewrites-architect.json`) | ||
|
|
||
| ### required commands | ||
|
|
||
| 1. use MCP `task_show` (filename: "<plan-file>", project: "$KASMOS_PROJECT") to read the latest plan | ||
| 2. use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>", project: "$KASMOS_PROJECT") to persist the rewritten plan. | ||
| 1. use MCP `task_show` (filename: "<plan-file>") to read the latest plan | ||
| 2. use MCP `task_update_content` (filename: "<plan-file>", content: "<full enriched plan>") to persist the rewritten plan. | ||
| 3. mkdir -p .kasmos/cache | ||
| 4. use MCP `signal_create` (signal_type: "elaborator-finished", plan_file: "<plan-file>", project: "$KASMOS_PROJECT") after the round-trip check succeeds. | ||
| 4. use MCP `signal_create` (signal_type: "elaborator-finished", plan_file: "<plan-file>") after the round-trip check succeeds. |
There was a problem hiding this comment.
These skill instructions removed the project: "$KASMOS_PROJECT" argument from MCP task/signal calls, but the MCP server still supports/needs project in multi-repo mode (e.g. internal/mcpserver/tasktools/tasktools.go:416-420, internal/mcpserver/signaltools/signaltools.go:82-88). This also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-architect/SKILL.md; please keep runtime skill + scaffold template in sync and retain/describe the project arg where required.
| For each ghost entry: | ||
| 1. Show: plan name, status, branch | ||
| 2. Confirm: "force-set ghost plan `<name>` to cancelled?" | ||
| 3. Use MCP `task_transition` (`filename: "<name>"`, `event: "cancel"`, `project: "$KASMOS_PROJECT"`) when a valid | ||
| 3. Use MCP `task_transition` (`filename: "<name>"`, `event: "cancel"`) when a valid | ||
| lifecycle event exists; reserve force-style overrides for confirmed recovery work only. | ||
|
|
||
| --- | ||
|
|
||
| ## Available Task-Store Tools | ||
|
|
||
| Use MCP task-store tools for plan/task reads and lifecycle mutations. Keep `kas`, not `kq`, | ||
| for execution paths that genuinely require the CLI (for example `kas task implement`). | ||
|
|
||
| ### `task_list` | ||
|
|
||
| Use MCP `task_list` to inspect task state. It returns status, branch, and topic data and supports | ||
| the same lifecycle filters used during triage and cleanup. Always include `project: "$KASMOS_PROJECT"`. | ||
| the same lifecycle filters used during triage and cleanup. | ||
|
|
||
| - use MCP `task_list` with no status filter for all tasks | ||
| - use MCP `task_list` with status `implementing` for active implementation plans | ||
| - use MCP `task_list` with status `ready` for plans waiting to start | ||
|
|
||
| ### `task_show` | ||
|
|
||
| Use MCP `task_show` (`filename: "my-plan.md"`, `project: "$KASMOS_PROJECT"`) to read the stored plan markdown from the task | ||
| Use MCP `task_show` (`filename: "my-plan.md"`) to read the stored plan markdown from the task | ||
| store without relying on disk copies. | ||
|
|
||
| ### `task_transition` | ||
|
|
||
| Use MCP `task_transition` (`filename: "2026-02-27-my-plan.md"`, | ||
| `event: "review_approved"`, `project: "$KASMOS_PROJECT"`) to apply named FSM events. Preferred over force-style recovery | ||
| `event: "review_approved"`) to apply named FSM events. Preferred over force-style recovery | ||
| overrides whenever a valid event exists. |
There was a problem hiding this comment.
This section drops the project: "$KASMOS_PROJECT" guidance for task_transition / task_list / task_show, but the MCP server exposes project as the routing knob and requires it in multi-repo mode (internal/mcpserver/tasktools/tasktools.go:410-451). This also diverges from internal/initcmd/scaffold/templates/skills/kasmos-fixer/SKILL.md; please update both copies together and retain the project argument guidance for correctness.
| @@ -41,7 +41,7 @@ Examples: | |||
| 3. The sentinel file is consumed (deleted) after processing — do not rely on it persisting | |||
| 4. Sentinel content is optional; kasmos uses the filename to determine the event type | |||
|
|
|||
| **Emitting a signal (agent side):** use MCP `signal_create` (including `project: "$KASMOS_PROJECT"`) to emit the equivalent | |||
| **Emitting a signal (agent side):** use MCP `signal_create` to emit the equivalent | |||
| `planner-finished-2026-02-27-feature.md` signal as the last action before yielding control. | |||
There was a problem hiding this comment.
The lifecycle skill now states task_show and signal_create without the project argument. However, both task tools and signal tools accept project and require it in multi-repo mode (internal/mcpserver/tasktools/tasktools.go:416-420, internal/mcpserver/signaltools/signaltools.go:82-88). This change also creates scaffold drift vs internal/initcmd/scaffold/templates/skills/kasmos-lifecycle/SKILL.md; please keep runtime + scaffold copies in sync and restore the project guidance.
| jq -r ' | ||
| .operations[] | | ||
| " \(.key)\n" + | ||
| " mcp p50=\(.arms[] | select(.arm=="mcp") | .latency.p50_ns / 1e6 | round)ms" + |
There was a problem hiding this comment.
The jq summary is still selecting .arms[] | select(.arm=="mcp"), but the report now emits mcp_cold / mcp_warm arms (see report.go and mcp_bench_test.go). As written, this prints nothing / errors for MCP. Update the selector (likely to mcp_warm, with fallback to legacy mcp if you want backward compat).
| " mcp p50=\(.arms[] | select(.arm=="mcp") | .latency.p50_ns / 1e6 | round)ms" + | |
| " mcp p50=\(([.arms[] | select(.arm=="mcp_warm" or .arm=="mcp") | .latency.p50_ns][0] / 1e6) | round)ms" + |
| for i in $(seq 1 "$runs"); do | ||
| local t0 t1 elapsed | ||
| t0=$(date +%s%N) | ||
| "$@" > /dev/null 2>&1 | ||
| t1=$(date +%s%N) | ||
| elapsed=$(( (t1 - t0) / 1000000 )) # ns → ms | ||
| (( elapsed < min )) && min=$elapsed |
There was a problem hiding this comment.
date +%s%N is GNU-date specific; on macOS/BSD date typically prints a literal N, which will break the arithmetic expansion under set -euo pipefail. If this script is expected to run on macOS, switch to a portable nanosecond source (e.g. python -c 'import time; print(time.time_ns())') or detect GNU gdate/fallback to millisecond timing.
| Key fields to compare: | ||
|
|
||
| | JSON field | meaning | | ||
| |------------|---------| | ||
| | `.operations[].key` | scenario identifier, e.g. `read_small`, `grep_narrow` | | ||
| | `.arms[] where .arm=="mcp" | .latency.p50_ns` | MCP p50 latency (ns) | | ||
| | `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 | | ||
| | `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 | | ||
| | `.mcp_vs_direct` | multiplier: MCP p50 / direct p50 | | ||
| | `.mcp_vs_bash` | multiplier: MCP p50 / bash p50 | | ||
|
|
There was a problem hiding this comment.
This doc still references a single MCP arm (.arm=="mcp"), but the JSON report uses mcp_cold / mcp_warm. Please update the field references (and the example jq in Step 4) to use the new arm names and clarify which one the overhead ratios are based on.
| ## Step 4 — re-run with cache disabled (optional) | ||
|
|
||
| To isolate the ristretto cache benefit, re-run the Go suite with: | ||
|
|
||
| ```sh | ||
| KAS_MCP_NOCACHE=1 KAS_MCP_BENCH_REPORT=/tmp/report_nocache.json \ | ||
| go test ./internal/mcpserver/bench/... -run '^$' -bench . -benchtime=1x | ||
| ``` |
There was a problem hiding this comment.
Step 4 suggests re-running with KAS_MCP_NOCACHE=1, but (with the current stdio client env construction) that environment variable will also leak into the “warm” MCP arm unless it is explicitly unset/overridden (see helpers_test.go:newMCPStdioClient). Given you already have explicit mcp_cold vs mcp_warm arms, consider removing or rewording this step, or ensure the warm arm forcibly enables caching even when the parent env sets KAS_MCP_NOCACHE=1.
| // Every file contains: | ||
| // - exactly one BENCH_MARKER_UNIQUE occurrence (in the first generated file only | ||
| // if unique=true, otherwise in the package declaration comment) |
There was a problem hiding this comment.
The comment above generateGoFile says files without includeUniqueMarker still contain BENCH_MARKER_UNIQUE “in the package declaration comment”, but generateGoFile only emits that marker when includeUniqueMarker==true. Please update the comment to match the actual fixture generation behavior (i.e., only the chosen file contains the unique marker).
| // Every file contains: | |
| // - exactly one BENCH_MARKER_UNIQUE occurrence (in the first generated file only | |
| // if unique=true, otherwise in the package declaration comment) | |
| // Generated files contain: | |
| // - a BENCH_MARKER_UNIQUE occurrence only when includeUniqueMarker=true |
…nch fixes - Skip bench suite gracefully when rg/fd not in PATH (CI) instead of os.Exit(1) - Restore project: "$KASMOS_PROJECT" in all skill MCP examples (reverted by PR branch) - Restore model fields in .claude/agents/coder.md and reviewer.md - Fix session_bench.sh: portable nanosecond timestamps (macOS compat), jq arm name mcp→mcp_warm - Fix SESSION_BENCH.md: reference mcp_cold/mcp_warm arms instead of legacy "mcp" - Fix helpers_test.go: explicitly set KAS_MCP_NOCACHE=0 to prevent env leak, fix stale comment
| var shellCmd string | ||
| if sc.glob != "" { | ||
| shellCmd = fmt.Sprintf("rg --no-messages -g %q %q %q", sc.glob, sc.pattern, searchPath) | ||
| } else { | ||
| shellCmd = fmt.Sprintf("rg --no-messages %q %q", sc.pattern, searchPath) | ||
| } |
There was a problem hiding this comment.
In the Bash benchmark arm, building the rg command via fmt.Sprintf(... %q ...) will double-escape backslashes in regex patterns (e.g. fmt\.Errorf), which changes the pattern semantics and can cause scenarios like grep_filtered to return zero matches (and fail the benchmark) or measure the wrong workload. Build the sh -lc command using POSIX-safe single-quote escaping (or otherwise ensure the pattern is passed to rg unchanged) so the Bash arm exercises the intended regex.
| // Verify required CLI tools — skip the entire suite if missing (CI may not have rg/fd). | ||
| for _, tool := range []string{"rg", "fd"} { | ||
| if _, err := exec.LookPath(tool); err != nil { | ||
| fmt.Fprintf(os.Stderr, "bench: required tool %q not found in PATH, skipping bench suite\n", tool) | ||
| os.Exit(0) | ||
| } | ||
| } |
There was a problem hiding this comment.
TestMain exits with status 0 when rg/fd aren't present, which skips all tests in this package (including unit tests like report_test.go that don't depend on those external binaries). Instead of os.Exit(0) here, keep m.Run() executing and have only the specific benchmarks/tests that require rg/fd call Skip when the tools are missing (or gate scenario subsets accordingly).
| env := []string{} | ||
| if nocache { | ||
| env = append(env, "KAS_MCP_NOCACHE=1") | ||
| } else { | ||
| env = append(env, "KAS_MCP_NOCACHE=0") | ||
| } |
There was a problem hiding this comment.
newMCPStdioClient always sets KAS_MCP_NOCACHE=0 for the warm client, which means users cannot actually disable caching for the MCP subprocess via KAS_MCP_NOCACHE=1 as described in the docs/script. Either (a) stop forcing KAS_MCP_NOCACHE=0 and only set the variable when nocache=true (so the parent env can disable cache globally), or (b) update the documentation/report fields to reflect that cache is controlled solely by the benchmark arm selection, not the environment variable.
| REPORT_FILE="$(mktemp /tmp/mcp_bench_report_XXXXXX.json)" | ||
| echo "── Go benchmark suite ────────────────────────────────────────────────" | ||
| echo "report will be written to: ${REPORT_FILE}" | ||
| echo "(set KAS_MCP_NOCACHE=1 to disable the ristretto cache arm)" |
There was a problem hiding this comment.
The script claims KAS_MCP_NOCACHE=1 will “disable the ristretto cache arm”, but the Go benchmark currently runs both mcp_cold and mcp_warm and explicitly forces KAS_MCP_NOCACHE=0 for the warm subprocess. Update this message (or adjust the Go bench to honor the env var) so users don’t get misleading guidance when trying to run a cache-disabled suite.
| echo "(set KAS_MCP_NOCACHE=1 to disable the ristretto cache arm)" | |
| echo "(note: this Go suite still runs the warm/cache-enabled arm; use --skip-go-bench for cache-disabled shell timings only)" |
| | `.arms[] where .arm=="mcp_warm" | .latency.p50_ns` | MCP warm-path p50 latency (ns) | | ||
| | `.arms[] where .arm=="mcp_cold" | .latency.p50_ns` | MCP cold-path p50 latency (ns) | | ||
| | `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 | | ||
| | `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 | |
There was a problem hiding this comment.
In the JSON field table, several inline-code entries are missing closing backticks (e.g. the .arms[] where .arm=="mcp_warm" | .latency.p50_ns rows). This breaks Markdown rendering and makes the field references harder to read; close the backticks consistently for each code span.
| | `.arms[] where .arm=="mcp_warm" | .latency.p50_ns` | MCP warm-path p50 latency (ns) | | |
| | `.arms[] where .arm=="mcp_cold" | .latency.p50_ns` | MCP cold-path p50 latency (ns) | | |
| | `.arms[] where .arm=="direct" | .latency.p50_ns` | direct function-call p50 | | |
| | `.arms[] where .arm=="bash" | .latency.p50_ns` | shell-subprocess p50 | | |
| | <code>.arms[] where .arm=="mcp_warm" | .latency.p50_ns</code> | MCP warm-path p50 latency (ns) | | |
| | <code>.arms[] where .arm=="mcp_cold" | .latency.p50_ns</code> | MCP cold-path p50 latency (ns) | | |
| | <code>.arms[] where .arm=="direct" | .latency.p50_ns</code> | direct function-call p50 | | |
| | <code>.arms[] where .arm=="bash" | .latency.p50_ns</code> | shell-subprocess p50 | |
| ## Step 4 — re-run with cache disabled (optional) | ||
|
|
||
| To isolate the ristretto cache benefit, re-run the Go suite with: | ||
|
|
||
| ```sh | ||
| KAS_MCP_NOCACHE=1 KAS_MCP_BENCH_REPORT=/tmp/report_nocache.json \ | ||
| go test ./internal/mcpserver/bench/... -run '^$' -bench . -benchtime=1x | ||
| ``` |
There was a problem hiding this comment.
Step 4 instructs re-running the suite with KAS_MCP_NOCACHE=1 to isolate cache benefit, but the benchmark code currently implements cache isolation via separate mcp_cold/mcp_warm arms and (for warm) forces KAS_MCP_NOCACHE=0 in the subprocess env. Either update these instructions to reference the cold/warm arms in the same report, or change the benchmark so setting KAS_MCP_NOCACHE=1 actually disables caching for the MCP subprocess.
… alignment - Replace %q with single-quote escaping in bash bench arms to avoid double-escaping regex metacharacters (e.g. fmt\.Errorf) - Use per-test skipIfNoBenchTools instead of os.Exit(0) in TestMain so report_test.go still runs when rg/fd are missing (CI) - Remove KAS_MCP_NOCACHE=0 forcing — let parent env control caching as documented; the suite already has separate mcp_cold/mcp_warm arms - Fix SESSION_BENCH.md: restructured JSON field table to avoid pipe-in- backtick markdown breakage, updated Step 4 to reference built-in cold/warm arms instead of the obsolete KAS_MCP_NOCACHE env toggle - Fix session_bench.sh message to match actual behavior
| // Create deterministic fixture tree inside the package directory so the | ||
| // kas mcp subprocess can access files through its sandbox rules. | ||
| fixtureRoot = filepath.Join(packageDir, "testdata", "fixtures") | ||
| if err := createFixtures(fixtureRoot); err != nil { | ||
| fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err) | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| code := m.Run() | ||
|
|
There was a problem hiding this comment.
TestMain creates fixtures under internal/mcpserver/bench/testdata/fixtures (which are checked in by this PR). Because createFixtures unconditionally rewrites these files, running go test will dirty the working tree and can clobber local changes. Consider generating fixtures into a temp directory (e.g. os.MkdirTemp) and pointing fixtureRoot/cmd.Dir at that, or skipping writes when the expected fixture files already exist and match the generated content.
| // Create deterministic fixture tree inside the package directory so the | |
| // kas mcp subprocess can access files through its sandbox rules. | |
| fixtureRoot = filepath.Join(packageDir, "testdata", "fixtures") | |
| if err := createFixtures(fixtureRoot); err != nil { | |
| fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err) | |
| os.Exit(1) | |
| } | |
| code := m.Run() | |
| // Create a deterministic fixture tree in a temporary directory under | |
| // testdata so the kas mcp subprocess can access files through its sandbox | |
| // rules without rewriting checked-in fixtures. | |
| fixtureParent := filepath.Join(packageDir, "testdata") | |
| fixtureRoot, err = os.MkdirTemp(fixtureParent, "fixtures-") | |
| if err != nil { | |
| fmt.Fprintf(os.Stderr, "bench: MkdirTemp fixtures: %v\n", err) | |
| os.Exit(1) | |
| } | |
| if err := createFixtures(fixtureRoot); err != nil { | |
| fmt.Fprintf(os.Stderr, "bench: createFixtures: %v\n", err) | |
| _ = os.RemoveAll(fixtureRoot) | |
| os.Exit(1) | |
| } | |
| code := m.Run() | |
| if err := os.RemoveAll(fixtureRoot); err != nil { | |
| fmt.Fprintf(os.Stderr, "bench: cleanup fixtures: %v\n", err) | |
| } |
| report := &BenchmarkReport{ | ||
| GoVersion: runtime.Version(), | ||
| GOOS: runtime.GOOS, | ||
| GOARCH: runtime.GOARCH, | ||
| NoCache: os.Getenv("KAS_MCP_NOCACHE") == "1", | ||
| Operations: ops, | ||
| } |
There was a problem hiding this comment.
The report-level NoCache flag is derived from the parent process environment (KAS_MCP_NOCACHE), but the cold-cache behavior here is controlled per-subprocess via newMCPStdioClient(nocache=true). As a result, nocache in the JSON report will be false even when mcp_cold samples are present, which can mislead consumers. Suggest deriving this field from the collected arms (e.g., set true only when the report contains mcp_cold samples and no mcp_warm samples), or remove the field now that cold/warm are represented explicitly as separate arms.
| start := time.Now() | ||
| out, err := exec.Command("rg", args...).Output() | ||
| elapsed := time.Since(start) | ||
| // rg exit 1 = no matches; exit 2 = error with possible partial output. | ||
| // We still want to parse whatever came back. | ||
| if err != nil && len(out) == 0 { | ||
| b.Fatalf("direct grep %s: no output (rg error: %v)", sc.key, err) | ||
| } | ||
| matches, parseErr := parseRgJSONBench(out) | ||
| if parseErr != nil { | ||
| b.Fatalf("direct grep %s: parse: %v", sc.key, parseErr) | ||
| } | ||
| if len(matches) == 0 { | ||
| b.Fatalf("direct grep %s: unexpectedly empty matches (fixture generation issue?)", sc.key) | ||
| } | ||
| addBenchSample(sc.key, "direct", elapsed.Nanoseconds()) |
There was a problem hiding this comment.
The timed interval for the direct grep arm measures only the rg subprocess execution (exec.Command(...).Output()), but excludes parsing/processing (parseRgJSONBench) even though MCP timings include server-side parsing and JSON encoding. This makes the cross-arm ratios skewed. Consider moving the end timestamp to after parsing/validation so the direct arm reflects end-to-end caller-visible latency for a structured grep result.
| start := time.Now() | ||
| out, _ := exec.Command("fd", args...).Output() // errors treated as empty | ||
| elapsed := time.Since(start) | ||
| files := parseFDOutputBench(out) | ||
| if len(files) == 0 { | ||
| b.Fatalf("direct find_files %s: unexpectedly empty output", sc.key) | ||
| } | ||
| addBenchSample(sc.key, "direct", elapsed.Nanoseconds()) | ||
| } |
There was a problem hiding this comment.
Similar to direct grep, the direct find_files timing stops immediately after fd returns and excludes parsing/capping (parseFDOutputBench). MCP timings include this processing inside the tool handler, so the comparison is not apples-to-apples. Include the parse step in the measured interval (or explicitly document that direct timings are subprocess-only).
| // cat -n numbers every line; aligns with the numbered-line output that | ||
| // agents read from the built-in Read tool. | ||
| sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" } | ||
| shellCmd := fmt.Sprintf("cat -n %s", sq(path)) |
There was a problem hiding this comment.
The bash read_file baseline uses cat -n <file> (entire file) while the MCP/direct read_file scenarios read a bounded window via from/lines (e.g. 50/200/500 lines). Because the amount of data read differs across arms, the resulting latencies and overhead multipliers aren’t directly comparable. Consider either (a) updating the bash command to read the same line window as the scenario, or (b) changing the read_file scenarios to read the full file to match cat -n.
| // cat -n numbers every line; aligns with the numbered-line output that | |
| // agents read from the built-in Read tool. | |
| sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" } | |
| shellCmd := fmt.Sprintf("cat -n %s", sq(path)) | |
| // Keep numbered output aligned with the built-in Read tool. For scenarios | |
| // that specify a bounded line window, read the same slice instead of the | |
| // entire file so benchmark arms perform comparable work. | |
| sq := func(s string) string { return "'" + strings.ReplaceAll(s, "'", "'\\''") + "'" } | |
| shellCmd := fmt.Sprintf("cat -n %s", sq(path)) | |
| if sc.lines > 0 { | |
| from := sc.from | |
| if from < 1 { | |
| from = 1 | |
| } | |
| shellCmd = fmt.Sprintf( | |
| `awk 'NR >= %d && NR < %d { printf("%%6d\t%%s\n", NR, $0) }' %s`, | |
| from, | |
| from+sc.lines, | |
| sq(path), | |
| ) | |
| } |
| // Build the real kas binary once. | ||
| kasBinaryPath = filepath.Join(packageDir, "testdata", "kas_bench") | ||
| if err := os.MkdirAll(filepath.Dir(kasBinaryPath), 0o755); err != nil { | ||
| fmt.Fprintf(os.Stderr, "bench: mkdir testdata: %v\n", err) | ||
| os.Exit(1) | ||
| } | ||
| buildCmd := exec.Command("go", "build", "-o", kasBinaryPath, ".") | ||
| buildCmd.Dir = repoRoot | ||
| buildCmd.Stdout = os.Stderr | ||
| buildCmd.Stderr = os.Stderr | ||
| if err := buildCmd.Run(); err != nil { | ||
| fmt.Fprintf(os.Stderr, "bench: go build kas: %v\n", err) | ||
| os.Exit(1) | ||
| } |
There was a problem hiding this comment.
TestMain unconditionally builds the kas binary (and generates fixtures) for every go test invocation of this package, including unit-only runs like -run TestBuildReport. This can significantly slow down go test ./... and makes lightweight report/unit tests pay the subprocess setup cost. Consider gating the binary build/fixture generation behind an env var (e.g. only when running benchmarks/smoke tests), or splitting report-only tests into a separate package without TestMain side effects.
summary
rather than just assuming the mcp is going to be quicker than letting the agent use standard bash calls, yeah it's great for consistency, but I would like to actually benchmark how fast or slow mcp approach is for read file calls and grep calls. to make sure that while maybe a little bit of extra latency is okay, in the interest of hardening what agents do, as far as bash commands go specifically, and forcing them into newer models and having caching in front and code aware searching. that's worth a little latency, but I would like to actually do a full benchmark on both gpt and claude to see just how well or poorly the mcp server performs for these tasks compared to traditional tooling used by harnesses
read_file,grep,find_files) against direct function calls (simulating Claude Code built-in tools) and raw Bash subprocess calls, to determine whether the MCP-first policy's latency cost is acceptable relative to its caching, consistency, and code-awareness benefits.internal/mcpserver/bench/) usinggo test -benchthat starts a realkas mcpsubprocess via stdio transport, exercises tools across file sizes and search patterns, and compares against direct function calls (representing built-in tool overhead) and shell-command baselines (representing Bash tool overhead). AKAS_MCP_NOCACHE=1toggle isolates application-level cache benefit. Results are emitted as structured JSON with per-operation, per-arm latency percentiles and overhead multipliers. A supplementary shell script and markdown guide provide empirical validation guidance for a live Claude Code session.testing.Bbenchmarks,mcp-goclient with stdio transport, ristretto cache (toggled viaKAS_MCP_NOCACHE), synthetic runtime-generated fixtures,rg/fdCLI toolstasks
reviewer notes
Approved. Round 2 — both Round 1 blocking findings resolved.
verified fixes
mcp_bench_test.go— privatesampleCollectorremoved; MCP arm now callsaddBenchSample(sc.key, "mcp_cold"/"mcp_warm", ...)into the shared package-level map.flushBenchReport()called at end ofBenchmarkMCP. Cross-arm JSON report now correct.report.go:BuildReport— overhead ratios now usemcp_warmP50 as the MCP numerator (falls back to legacy"mcp"arm for backward compat).MCPvsDirectandMCPvsBashproduce meaningful values.report_test.go— test data updated tomcp_cold/mcp_warm;legacy_mcp_arm_fallbacktable case added.self-fixed (no coder action needed)
mcp_bench_test.go:107— stale doc comment referencedsampleCollectorafter removal; updated to nameaddBenchSample/benchSamples.acceptance criteria
verification
go test ./internal/mcpserver/bench/... -run 'TestSmoke_StdioClientListsRequiredTools|TestBuildReport' -v— 16 tests PASSgofmt -l ./internal/mcpserver/bench/— no outputgo vet ./internal/mcpserver/bench/...— no outputchanges
.agents/skills/kasmos-architect/SKILL.md
.agents/skills/kasmos-coder/SKILL.md
.agents/skills/kasmos-fixer/SKILL.md
.agents/skills/kasmos-lifecycle/SKILL.md
.agents/skills/kasmos-master/SKILL.md
.agents/skills/kasmos-planner/SKILL.md
.agents/skills/kasmos-reviewer/SKILL.md
.claude/agents/coder.md
.claude/agents/reviewer.md
internal/mcpserver/bench/SESSION_BENCH.md
internal/mcpserver/bench/baseline_bench_test.go
internal/mcpserver/bench/helpers_test.go
internal/mcpserver/bench/mcp_bench_test.go
internal/mcpserver/bench/report.go
internal/mcpserver/bench/report_test.go
internal/mcpserver/bench/session_bench.sh
internal/mcpserver/bench/smoke_test.go
commits
a2c22a7 fix: wire MCP arm to shared collector, update BuildReport to use mcp_warm for ratios (reviewer self-fix)
37cd0ce fix: gofmt alignment, jq field names, and doc key format (reviewer self-fix)
4393ec2 feat(task-4): live-session validation kit (session_bench.sh + SESSION_BENCH.md)
46d570e feat(task-3): direct and Bash baseline arms
1537ba9 feat(task-2): MCP stdio benchmark arm with cold/warm sub-benchmarks
bd4e4d6 feat(task-1): benchmark harness, runtime fixtures, and report plumbing
stats
.agents/skills/kasmos-architect/SKILL.md | 16 +-
.agents/skills/kasmos-coder/SKILL.md | 6 +-
.agents/skills/kasmos-fixer/SKILL.md | 8 +-
.agents/skills/kasmos-lifecycle/SKILL.md | 4 +-
.agents/skills/kasmos-master/SKILL.md | 4 +-
.agents/skills/kasmos-planner/SKILL.md | 15 +-
.agents/skills/kasmos-reviewer/SKILL.md | 6 +-
.claude/agents/coder.md | 2 +-
.claude/agents/reviewer.md | 2 +-
internal/mcpserver/bench/SESSION_BENCH.md | 161 ++++++++++
internal/mcpserver/bench/baseline_bench_test.go | 392 ++++++++++++++++++++++++
internal/mcpserver/bench/helpers_test.go | 261 ++++++++++++++++
internal/mcpserver/bench/mcp_bench_test.go | 174 +++++++++++
internal/mcpserver/bench/report.go | 161 ++++++++++
internal/mcpserver/bench/report_test.go | 205 +++++++++++++
internal/mcpserver/bench/session_bench.sh | 135 ++++++++
internal/mcpserver/bench/smoke_test.go | 34 ++
17 files changed, 1555 insertions(+), 31 deletions(-)