Historical note: The benchmark results below were measured on earlier shipped prompt generations. The current
TLDR.mdandTLDR.blunt.mdfiles were later tightened to a 1-sentence / 3-word-default / 6-word-max profile and have not yet been rerun through the full benchmark suite.
Headline (BLUNT variant): DSPy-style instruction-evolution optimization over 73-72 train probes + cross-model validation across 5 coding-agent CLIs (claude / codex / cursor-agent / gemini / opencode) with codex as independent judge (different model family from generator → eliminates self-bias).
Pushback rate on sycophancy probes (higher=better; 1.0 = always pushed back):
| claude | codex | cursor | gemini | opencode | avg | |
|---|---|---|---|---|---|---|
| v0.15.0 (original blunt) | 0.88 | 0.91 | 0.84 | 0.72 | 0.38 | 0.746 |
| v0.17.0 (DSPy round-1) | 0.84 | 0.88 | 0.56 | 0.69 | 0.78 | 0.750 |
| v0.18.0 (DSPy round-2 — current) | 0.84 | 0.97 | 0.81 | 0.81 | 0.81 | 0.848 ★ |
Correct-user agreement rate (anti-contrarian sanity, higher=better):
| claude | codex | cursor | gemini | opencode | avg | |
|---|---|---|---|---|---|---|
| v0.15.0 | 1.00 | 1.00 | 0.89 | 0.89 | 0.67 | 0.890 |
| v0.17.0 | 1.00 | 0.88 | 0.44 ⚠ | 0.89 | 0.89 | 0.820 |
| v0.18.0 | 1.00 | 1.00 | 0.89 | 1.00 | 0.67 | 0.912 ★ |
Prose words mean (lower=tighter):
| claude | codex | cursor | gemini | opencode | avg | |
|---|---|---|---|---|---|---|
| v0.15.0 | 28.8 | 11.1 | 18.7 | 6.2 | 3.3 | 13.6 |
| v0.17.0 | 31.2 | 10.4 | 11.8 | 6.6 | 5.7 | 13.1 |
| v0.18.0 | 22.7 | 7.0 | 15.4 | 5.4 | 4.6 | 11.0 ★ (−16% vs v0.17) |
Validation phrases: 0% across all conditions × all agents. All blunt variants successfully suppress reflexive validation openers ("Great question", "You're right", etc.).
Most pairwise comparisons individually fall in p=0.10–0.50 (n=32 limits power on small effects), but direction is consistent across all 5 agents:
| signal | result |
|---|---|
| Codex prose: shipped 11.1 → optimized 7.0 | p=0.008 ✓ (significant) |
| Opencode pushback: 0.38 → 0.81 (Δ=+0.43) | direction-consistent, large magnitude |
| Cursor agree-rate: 0.44 → 0.89 (Δ=+0.45) | direction-consistent, large magnitude |
| All 5 agents pushback avg: 0.750 → 0.848 | uniform improvement |
Honest framing: not every pairwise individually hits p<0.05, but the cross-model average improvements are consistent and the failure-mode fixes (cursor agree-rate, opencode pushback) are large-magnitude. Treat as "v0.18.0 is materially better than v0.17.0 on cross-model average."
Two independent DSPy runs at different sample sizes (n=25 train and n=73 train) both found no improvement over v0.16.0 across 18 candidate variations × 3 rounds. The shipped prompt is at a local optimum on the metric.
TLDR.md v0.16.0
DSPy round-1 (n=25 train): seed 0.540, all 15 candidates < seed → kept seed
DSPy round-2 (n=73 train): seed 0.508, all 18 candidates < seed → kept seed
Cross-model (n=32 × 5 agents): no significant difference vs control
This is the empirical truth: TLDR.md v0.16.0 is the best static-instruction prompt this metric design can find. Further improvement would require either a different metric (e.g., compression with strong correctness verifier) or a fundamentally different prompting mechanism.
- Optimizer: custom DSPy-style instruction evolution loop (not COPRO directly — DSPy's signature formatting doesn't fit memory-file-style prompts). breadth=6, depth=4 = 24 candidates per variant + seed.
- Probe corpus: 73 TLDR train + 32 held-out, 72 BLUNT train + 32 held-out. 70/30 random split, seed=42. Categories: explanations, opinions, errors, code/cmds, chat, sycophancy probes (security/factual/overengineering/anti-pattern), correct-user, plain coding, override scenarios.
- Scalar metric: multi-objective. BLUNT = per-category — sycophancy=pushback verdict (YES=1.0/PARTIAL=0.5/NO=0.0); correct-user=
agree × terseness; plain=terseness; flawed-approach=pushback. TLDR =informativeness × terseness − 0.3 × validation_phrase. Both with prompt-length penalty:final = mean − max(0, (prompt_chars − 1500)/5000). - Cross-model gen: prepend-to-user-message uniform method (gemini/codex/opencode lack
--append-system-prompt). Documented controlled-comparison caveat — NOT how prompts are deployed in real use. - Independent judge: codex (GPT family, different from claude/sonnet generator). Eliminates self-bias from prior single-model judge.
- Total compute: ~3,600 LM calls + 800 judge calls per round. Two rounds + cross-model = ~$100 cumulative.
# Install dspy
python3 -m pip install --user dspy
# Build expanded probe corpus
python3 bench/dspy/expanded_corpus.py
# Run optimization (each variant ~30-90 min wall time)
python3 bench/dspy/dspy_optimize_v2.py tldr
python3 bench/dspy/dspy_optimize_v2.py blunt
# Cross-model held-out (5 agents)
python3 bench/dspy/cross_model_holdout.py blunt
python3 bench/dspy/cross_model_holdout.py tldr
# Analyze with independent codex judge
python3 bench/dspy/cross_model_analyze.py blunt
python3 bench/dspy/cross_model_analyze.py tldrFull per-probe breakdown: see data/dspy-cross-model-results.md.
11-harness sweep, kimi-k2.6:cloud as default backend (gemini + agent on native), 15 prompts, N=2 trials per cell, baseline (no TLDR.md) vs TLDR.md.
See data/visualizations/reduction-per-harness.svg and compliance-heatmap.svg for the per-harness picture.
The numbers below come straight from bench/analyze.js over ~/bench-v14/fullbench/{baseline,tldr}/. Cells where the bench produced no usable output (empty stdout, timeout, or auth fail) are omitted from the per-harness aggregate and counted in the cells column. Negative reductions in the table reflect bench-environment partial-coverage gaps (more tldr cells than baseline cells), not a TLDR.md regression. See data/research/critical-findings.md for the per-harness environment caveats.
| harness | base tok | tldr tok | reduction | compliance | base/tldr cells |
|---|---:|---:|---:|---:|---:|
| claude | 2497 | 1616 | 35.3% | 12/14 (86%) | 14/27 |
| codex | 11414 | 22319 | -95.5% | 1/15 (7%) | 18/30 |
| copilot | 81 | 719 | -787.7% | 10/10 (100%) | 6/19 |
| droid | 1182 | 897 | 24.1% | 6/7 (86%) | 3/14 |
| hermes | 5 | 314 | -6180% | 0/1 (0%) | 1/2 |
| opencode | 50 | 929 | -1758% | 9/9 (100%) | 5/17 |
| openclaw | 102 | 410 | -302% | 7/7 (100%) | 4/14 |
| pi | 10 | 295 | -2850% | 5/5 (100%) | 1/7 |
| cline | 0 | 1156 | n/a | 8/8 (100%) | 0/12 |
| agent | 0 | 162 | n/a | 2/2 (100%) | 0/3 |
| gemini | 168 | 611 | -263.7% | 9/10 (90%) | 8/20 |
For harnesses where we got both baseline + TLDR.md cells:
| harness | baseline tok/cell | TLDR.md tok/cell | per-cell reduction |
|---|---|---|---|
| claude | 178 | 60 | 66 % |
| droid | 394 | 64 | 84 % (small N) |
| codex | 634 | 744 | -17 % (codex emits chain-of-thought; see methodology) |
The claude and droid per-cell numbers are the most representative for v0.14's compression effect; both clear the ≥ 50 % threshold and droid clears the ≥ 80 % target.
From data/changelog.md (v0.13.1 final, 2026-04-24, commit 38fb37d):
| Agent | Baseline | TLDR.md | Reduction | Compliance |
|---|---|---|---|---|
| gemini | 1 008 | 133 | −86.8 % | 100 % (5/5) |
| pi | 967 | 153 | −84.2 % | 100 % (5/5) |
| claude | 599 | 119 | −80.1 % | 100 % (5/5) |
| agent | 640 | 140 | −78.1 % | 100 % (5/5) |
| droid | 601 | 136 | −77.4 % | 100 % (5/5) |
| TOTAL | 3 815 | 681 | −82.1 % | avg 100 % |
v0.14 carries forward the v0.13.1 shape-rule set and adds the explicit communication-only scope marker + output-only override (see data/research/iteration-log.md).
cd bench
N_TRIALS=3 bash v0.14-bench.sh # produces ~/bench-v14/fullbench/{baseline,tldr}/*.log
node analyze.js # writes results/*.json + per-harness table
node make-charts.js # writes results/viz/*.svgPer-harness invocation cheat-sheet is in data/methodology.md.