Generalized autoresearch for Claude Code, packaged as an installable plugin.
/ratchet wraps any command that emits a numeric metric in an autonomous optimization loop: Claude proposes a code change, Python runs trials, averages, decides whether the change improved the metric, and Claude either keeps or reverts. Repeat until a threshold is reached, the iteration cap is hit, or improvement plateaus.
Karpathy's original autoresearch is hardcoded to ML training (val_bpb on train.py). This generalization works for any objective β model accuracy, test-suite wall time, p95 latency, code golf line counts, anything the user can express as a single command that prints a metric.
/plugin marketplace add mweiden/ratchet
/plugin install ratchet@ratchet-marketplaceOr, locally during development:
/plugin marketplace add /path/to/ratchet
/plugin install ratchet@ratchet-marketplaceFrom inside any git repo with a clean working tree:
/ratchet --run-command "pytest -q tests/" \
--direction min \
--metric-name "wall_time" \
--report-source stdout \
--trials 3 \
--iterations 30 \
--patience 5 \
--instructions "Do not weaken any assertion. Do not skip tests. Performance only."
| flag | required | default | description |
|---|---|---|---|
--run-command |
yes | β | shell command that produces the report |
--direction |
yes | β | max or min |
--metric-name |
yes | β | name of the metric inside the report |
--report-source |
yes | β | stdout, stderr, or a path read after each run |
--trials |
no | 1 | runs per iteration; averaged |
--iterations |
no | 20 | hard cap |
--threshold |
no | none | early stop when best score crosses this |
--patience |
no | 0 (off) | stop after K iterations without improvement |
--min-improvement |
no | 0 | required margin over current best to count as "improve" |
--max-consecutive-crashes |
no | 3 | hard stop on N crashes in a row |
--trial-timeout-seconds |
no | none | per-trial wall-clock cap |
--critic |
no | off | enable two-proposer + judge mode |
--instructions / --instructions-file |
no | "" | run-specific integrity rules |
| Python (deterministic) | Claude (LLM) |
|---|---|
| Loop control & iteration counter | What code to change |
| Trial execution + report capture | Editing files |
| Averaging trial scores | Parsing the metric out of the report |
| Improve / regress / threshold logic | Describing the change |
| Stop conditions (cap, threshold, patience, crash budget) | Crash diagnosis |
Persistence (.ratchet/config.json, results.jsonl) |
git commit / reset |
The skill loads instructions/integrity.md at every run. Universal rule: edit the subject under optimization, never the evaluation harness (test cases, eval data, scoring code, the report source). Examples and forbidden moves are spelled out there.
--instructions (or --instructions-file) layers run-specific rules on top. They're echoed in every status response so the model is re-anchored each iteration. The model is forbidden from editing .ratchet/, config.json, or the path named in --report-source.
.claude-plugin/marketplace.json
plugins/ratchet/
βββ .claude-plugin/plugin.json
βββ skills/ratchet/
β βββ SKILL.md
β βββ instructions/ # lazy-loaded
β β βββ integrity.md
β β βββ critic-mode.md
β β βββ debugging.md
β β βββ reverting.md
β βββ scripts/
β βββ ratchet.py # CLI entry (init/status/run-trial/...)
β βββ config.py runner.py decision.py store.py
β βββ test.sh
β βββ tests/ # stdlib unittest, no pip deps
βββ agents/
βββ ratchet-proposer-a.md
βββ ratchet-proposer-b.md
βββ ratchet-critic.md
plugins/ratchet/skills/ratchet/scripts/test.shStdlib only; needs Python β₯ 3.10 and git on PATH. Covers config validation + round-trip, runner (stdout/stderr/file capture, timeout, non-zero exit), decision logic (improve/regress/no-change with min_improvement, threshold, all stop conditions), store (init, archive on re-init, history, derived state), and end-to-end CLI flow against a temp git repo.
.ratchet/
βββ config.json # current run config + run_id
βββ results.jsonl # one record per finalized iteration
βββ current.json # in-progress iteration buffer
βββ reports/iter-NNNN-trial-NN.txt # captured reports (preserved for debugging)
βββ runs/<run_id>/ # archive of prior runs (re-init bumps the current run here)
results.jsonl records: iter, hypothesis, trial_scores, avg, decision, kept, description, sha, timestamp. Treat it as a research log β predicted (hypothesis) vs. observed (avg, decision).
MIT.