ratchet

Generalized autoresearch for Claude Code, packaged as an installable plugin.

/ratchet wraps any command that emits a numeric metric in an autonomous optimization loop: Claude proposes a code change, Python runs trials, averages, decides whether the change improved the metric, and Claude either keeps or reverts. Repeat until a threshold is reached, the iteration cap is hit, or improvement plateaus.

Karpathy's original autoresearch is hardcoded to ML training (val_bpb on train.py). This generalization works for any objective — model accuracy, test-suite wall time, p95 latency, code golf line counts, anything the user can express as a single command that prints a metric.

Install

/plugin marketplace add mweiden/ratchet
/plugin install ratchet@ratchet-marketplace

Or, locally during development:

/plugin marketplace add /path/to/ratchet
/plugin install ratchet@ratchet-marketplace

Use

From inside any git repo with a clean working tree:

/ratchet --run-command "pytest -q tests/" \
         --direction min \
         --metric-name "wall_time" \
         --report-source stdout \
         --trials 3 \
         --iterations 30 \
         --patience 5 \
         --instructions "Do not weaken any assertion. Do not skip tests. Performance only."

flag	required	default	description
`--run-command`	yes	—	shell command that produces the report
`--direction`	yes	—	`max` or `min`
`--metric-name`	yes	—	name of the metric inside the report
`--report-source`	yes	—	`stdout`, `stderr`, or a path read after each run
`--trials`	no	1	runs per iteration; averaged
`--iterations`	no	20	hard cap
`--threshold`	no	none	early stop when best score crosses this
`--patience`	no	0 (off)	stop after K iterations without improvement
`--min-improvement`	no	0	required margin over current best to count as "improve"
`--max-consecutive-crashes`	no	3	hard stop on N crashes in a row
`--trial-timeout-seconds`	no	none	per-trial wall-clock cap
`--critic`	no	off	enable two-proposer + judge mode
`--instructions` / `--instructions-file`	no	""	run-specific integrity rules

How it splits work

Python (deterministic)	Claude (LLM)
Loop control & iteration counter	What code to change
Trial execution + report capture	Editing files
Averaging trial scores	Parsing the metric out of the report
Improve / regress / threshold logic	Describing the change
Stop conditions (cap, threshold, patience, crash budget)	Crash diagnosis
Persistence (`.ratchet/config.json`, `results.jsonl`)	git commit / reset

Integrity

The skill loads instructions/integrity.md at every run. Universal rule: edit the subject under optimization, never the evaluation harness (test cases, eval data, scoring code, the report source). Examples and forbidden moves are spelled out there.

--instructions (or --instructions-file) layers run-specific rules on top. They're echoed in every status response so the model is re-anchored each iteration. The model is forbidden from editing .ratchet/, config.json, or the path named in --report-source.

Layout

.claude-plugin/marketplace.json
plugins/ratchet/
├── .claude-plugin/plugin.json
├── skills/ratchet/
│   ├── SKILL.md
│   ├── instructions/                  # lazy-loaded
│   │   ├── integrity.md
│   │   ├── critic-mode.md
│   │   ├── debugging.md
│   │   └── reverting.md
│   └── scripts/
│       ├── ratchet.py                 # CLI entry (init/status/run-trial/...)
│       ├── config.py runner.py decision.py store.py
│       ├── test.sh
│       └── tests/                     # stdlib unittest, no pip deps
└── agents/
    ├── ratchet-proposer-a.md
    ├── ratchet-proposer-b.md
    └── ratchet-critic.md

Tests

plugins/ratchet/skills/ratchet/scripts/test.sh

Stdlib only; needs Python ≥ 3.10 and git on PATH. Covers config validation + round-trip, runner (stdout/stderr/file capture, timeout, non-zero exit), decision logic (improve/regress/no-change with min_improvement, threshold, all stop conditions), store (init, archive on re-init, history, derived state), and end-to-end CLI flow against a temp git repo.

State on disk

.ratchet/
├── config.json                    # current run config + run_id
├── results.jsonl                  # one record per finalized iteration
├── current.json                   # in-progress iteration buffer
├── reports/iter-NNNN-trial-NN.txt # captured reports (preserved for debugging)
└── runs/<run_id>/                 # archive of prior runs (re-init bumps the current run here)

results.jsonl records: iter, hypothesis, trial_scores, avg, decision, kept, description, sha, timestamp. Treat it as a research log — predicted (hypothesis) vs. observed (avg, decision).

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
plugins/ratchet		plugins/ratchet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ratchet

Install

Use

How it splits work

Integrity

Layout

Tests

State on disk

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ratchet

Install

Use

How it splits work

Integrity

Layout

Tests

State on disk

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages