|
| 1 | +# XcodeBuildMCP Eval Deep Dive — Token Usage, Performance, and Next Steps |
| 2 | + |
| 3 | +Date: February 24, 2026 |
| 4 | +Author: Codex assistant (session with Cam) |
| 5 | +Run analyzed: `/Users/cameroncooke/Developer/mcp_evals/runs/20260120_225600` |
| 6 | + |
| 7 | +## Executive Summary |
| 8 | + |
| 9 | +The large token delta reported for XcodeBuildMCP vs `shell_primed` is real, but its root cause is often misinterpreted. |
| 10 | + |
| 11 | +- The largest deltas are dominated by **cached input tokens** (replayed context across turns), not uncached “new reasoning” tokens. |
| 12 | +- This behavior is **not unique to XcodeBuildMCP**. It is a general multi-turn/tool-use property. |
| 13 | +- In this eval setup, XcodeBuildMCP incurs extra overhead due to: |
| 14 | + 1. More setup/discovery turns, |
| 15 | + 2. Larger tool surface/context, |
| 16 | + 3. More replay cycles of prior context. |
| 17 | +- Therefore, the reported “+100k tokens” is mostly a **session-level usage accounting effect**, not direct evidence that the model exhausted context window capacity. |
| 18 | + |
| 19 | +That said, XcodeBuildMCP still underperformed `shell_primed` on wall time and error rates in this dataset, so there are real workflow optimization opportunities. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## What We Verified in the Data |
| 24 | + |
| 25 | +### 1) Token delta exists across agents, not just one |
| 26 | + |
| 27 | +Comparing `mcp_unprimed_v2` vs `shell_primed` (mean per run, non-baseline): |
| 28 | + |
| 29 | +- **Codex**: total input `+120,793` tokens |
| 30 | + - uncached `-8,453` |
| 31 | + - cached `+129,246` |
| 32 | +- **Claude Opus**: total input `+84,040` |
| 33 | + - uncached `+111` |
| 34 | + - cached `+83,929` |
| 35 | +- **Claude Sonnet**: total input `+132,568` |
| 36 | + - uncached `+228` |
| 37 | + - cached `+132,340` |
| 38 | + |
| 39 | +Interpretation: the dominant effect is replayed cached context. |
| 40 | + |
| 41 | +### 2) MCP path has more setup/tool turns |
| 42 | + |
| 43 | +For MCP scenarios, a large percentage of calls are setup/discovery: |
| 44 | +- `session-set-defaults`, `list_schemes`, `list_sims`, `discover_projs` account for most MCP calls. |
| 45 | +- This increases turn count before build/test execution starts. |
| 46 | + |
| 47 | +### 3) More turns + bigger replayed context => large cumulative usage |
| 48 | + |
| 49 | +This is the key mechanics point: |
| 50 | +- Context window occupancy at a single turn is not “compounded”. |
| 51 | +- But **total token usage across session is compounded** because each turn reprocesses large prior prefixes (mostly cached). |
| 52 | + |
| 53 | +So a 4k–10k larger reusable prefix can translate to very large cumulative token deltas over many turns. |
| 54 | + |
| 55 | +### 4) Performance and reliability gaps vs `shell_primed` |
| 56 | + |
| 57 | +Across agents, `mcp_unprimed_v2` generally trails `shell_primed` on: |
| 58 | +- wall time, |
| 59 | +- tool errors, |
| 60 | +- time-to-first-build. |
| 61 | + |
| 62 | +Success rates are often comparable, but MCP gets there with more overhead in this run configuration. |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## Clarification: Token Usage vs Context Window Usage |
| 67 | + |
| 68 | +A key confusion to correct in reporting: |
| 69 | + |
| 70 | +- **High cached token usage does not automatically mean context window was “wasted” or exhausted.** |
| 71 | +- Cached tokens are primarily an inference accounting/cost metric across turns. |
| 72 | +- Context pressure is a separate per-turn issue (depends on what is currently in-window, truncation/summarization, etc.). |
| 73 | + |
| 74 | +This distinction should be explicit in future writeups. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## Why the Current Comparison Is Not Fully Fair |
| 79 | + |
| 80 | +`shell_primed` is given deterministic build parameters in prompt. |
| 81 | + |
| 82 | +`mcp_unprimed` / `mcp_unprimed_v2` are not equivalently pre-seeded; they discover and set defaults during the run. |
| 83 | + |
| 84 | +That asymmetry structurally biases MCP toward extra turns and replay overhead. |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## Planned Next Eval (Recommended) |
| 89 | + |
| 90 | +Add a new scenario: |
| 91 | + |
| 92 | +## `mcp_persisted_defaults` |
| 93 | + |
| 94 | +Use production XcodeBuildMCP session-default persistence and pre-seed defaults prior to trial start. |
| 95 | + |
| 96 | +Suggested scenario matrix: |
| 97 | +1. `shell_primed` |
| 98 | +2. `mcp_unprimed_v2` |
| 99 | +3. `mcp_persisted_defaults` (new) |
| 100 | + |
| 101 | +Report these metrics prominently: |
| 102 | +- success rate, |
| 103 | +- wall time, |
| 104 | +- tool errors, |
| 105 | +- time-to-first-build, |
| 106 | +- MCP call mix (setup vs execution), |
| 107 | +- uncached input tokens, |
| 108 | +- cached input tokens, |
| 109 | +- billed cost, |
| 110 | +- cold-equivalent cost. |
| 111 | + |
| 112 | +Primary efficiency metric for cross-agent/model comparisons should be **uncached input tokens** (plus wall time), with cached totals clearly labeled as replay/accounting heavy. |
| 113 | + |
| 114 | +--- |
| 115 | + |
| 116 | +## Opportunities to Improve XcodeBuildMCP |
| 117 | + |
| 118 | +1. **Collapse setup turns** |
| 119 | + - Provide/encourage a single bootstrap call (discover + defaults + selected target/sim) where possible. |
| 120 | + |
| 121 | +2. **Lean response mode for agent workflows** |
| 122 | + - Reduce verbose “next steps” / repeated boilerplate in tool outputs for eval/agent mode. |
| 123 | + |
| 124 | +3. **Task-scoped tool exposure** |
| 125 | + - Reduce visible tool surface when task scope is known (fewer irrelevant tools). |
| 126 | + |
| 127 | +4. **Persisted defaults first-class UX** |
| 128 | + - Make persisted defaults the default recommended flow for repeat sessions and eval harnesses. |
| 129 | + |
| 130 | +5. **Error payload quality** |
| 131 | + - Continue improving MCP error clarity to reduce retry churn and malformed follow-up calls. |
| 132 | + |
| 133 | +6. **Deterministic fast path docs/prompts** |
| 134 | + - Publish a concise “low-turn recipe” for common build/test/install/launch flows to reduce exploratory calls. |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## Shortcomings to Address in the Blog Post |
| 139 | + |
| 140 | +1. **Conflating cumulative token usage with context-window pressure** |
| 141 | + - Clarify that high cached totals are mostly replay accounting across turns. |
| 142 | + |
| 143 | +2. **Understating scenario asymmetry** |
| 144 | + - Explicitly call out that `shell_primed` had stronger deterministic priming than MCP scenarios. |
| 145 | + |
| 146 | +3. **Insufficient emphasis on uncached vs cached split** |
| 147 | + - Show uncached deltas separately; this changes interpretation of “waste”. |
| 148 | + |
| 149 | +4. **Need stronger caveat about provider/accounting differences** |
| 150 | + - Keep billed vs cold-equivalent vs uncached separated and explained. |
| 151 | + |
| 152 | +5. **Frame result as workflow/tooling optimization target, not MCP category verdict** |
| 153 | + - Current result is about this tool surface + run shape + harness design, not all MCP usage. |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## Final Position |
| 158 | + |
| 159 | +The current data does not support “MCP inherently wastes context window.” |
| 160 | + |
| 161 | +It does support: |
| 162 | +- MCP workflow shape in this eval caused more turns and larger replayed context, |
| 163 | +- this inflated cumulative token usage (mostly cached), |
| 164 | +- and increased wall-time/error overhead versus a strongly primed shell baseline. |
| 165 | + |
| 166 | +The next decisive test is `mcp_persisted_defaults` under matched priming conditions. |
0 commit comments