Skip to content

mweiden/ratchet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ratchet

tests

Generalized autoresearch for Claude Code, packaged as an installable plugin.

/ratchet wraps any command that emits a numeric metric in an autonomous optimization loop: Claude proposes a code change, Python runs trials, averages, decides whether the change improved the metric, and Claude either keeps or reverts. Repeat until a threshold is reached, the iteration cap is hit, or improvement plateaus.

Karpathy's original autoresearch is hardcoded to ML training (val_bpb on train.py). This generalization works for any objective β€” model accuracy, test-suite wall time, p95 latency, code golf line counts, anything the user can express as a single command that prints a metric.

Install

/plugin marketplace add mweiden/ratchet
/plugin install ratchet@ratchet-marketplace

Or, locally during development:

/plugin marketplace add /path/to/ratchet
/plugin install ratchet@ratchet-marketplace

Use

From inside any git repo with a clean working tree:

/ratchet --run-command "pytest -q tests/" \
         --direction min \
         --metric-name "wall_time" \
         --report-source stdout \
         --trials 3 \
         --iterations 30 \
         --patience 5 \
         --instructions "Do not weaken any assertion. Do not skip tests. Performance only."
flag required default description
--run-command yes β€” shell command that produces the report
--direction yes β€” max or min
--metric-name yes β€” name of the metric inside the report
--report-source yes β€” stdout, stderr, or a path read after each run
--trials no 1 runs per iteration; averaged
--iterations no 20 hard cap
--threshold no none early stop when best score crosses this
--patience no 0 (off) stop after K iterations without improvement
--min-improvement no 0 required margin over current best to count as "improve"
--max-consecutive-crashes no 3 hard stop on N crashes in a row
--trial-timeout-seconds no none per-trial wall-clock cap
--critic no off enable two-proposer + judge mode
--instructions / --instructions-file no "" run-specific integrity rules

How it splits work

Python (deterministic) Claude (LLM)
Loop control & iteration counter What code to change
Trial execution + report capture Editing files
Averaging trial scores Parsing the metric out of the report
Improve / regress / threshold logic Describing the change
Stop conditions (cap, threshold, patience, crash budget) Crash diagnosis
Persistence (.ratchet/config.json, results.jsonl) git commit / reset

Integrity

The skill loads instructions/integrity.md at every run. Universal rule: edit the subject under optimization, never the evaluation harness (test cases, eval data, scoring code, the report source). Examples and forbidden moves are spelled out there.

--instructions (or --instructions-file) layers run-specific rules on top. They're echoed in every status response so the model is re-anchored each iteration. The model is forbidden from editing .ratchet/, config.json, or the path named in --report-source.

Layout

.claude-plugin/marketplace.json
plugins/ratchet/
β”œβ”€β”€ .claude-plugin/plugin.json
β”œβ”€β”€ skills/ratchet/
β”‚   β”œβ”€β”€ SKILL.md
β”‚   β”œβ”€β”€ instructions/                  # lazy-loaded
β”‚   β”‚   β”œβ”€β”€ integrity.md
β”‚   β”‚   β”œβ”€β”€ critic-mode.md
β”‚   β”‚   β”œβ”€β”€ debugging.md
β”‚   β”‚   └── reverting.md
β”‚   └── scripts/
β”‚       β”œβ”€β”€ ratchet.py                 # CLI entry (init/status/run-trial/...)
β”‚       β”œβ”€β”€ config.py runner.py decision.py store.py
β”‚       β”œβ”€β”€ test.sh
β”‚       └── tests/                     # stdlib unittest, no pip deps
└── agents/
    β”œβ”€β”€ ratchet-proposer-a.md
    β”œβ”€β”€ ratchet-proposer-b.md
    └── ratchet-critic.md

Tests

plugins/ratchet/skills/ratchet/scripts/test.sh

Stdlib only; needs Python β‰₯ 3.10 and git on PATH. Covers config validation + round-trip, runner (stdout/stderr/file capture, timeout, non-zero exit), decision logic (improve/regress/no-change with min_improvement, threshold, all stop conditions), store (init, archive on re-init, history, derived state), and end-to-end CLI flow against a temp git repo.

State on disk

.ratchet/
β”œβ”€β”€ config.json                    # current run config + run_id
β”œβ”€β”€ results.jsonl                  # one record per finalized iteration
β”œβ”€β”€ current.json                   # in-progress iteration buffer
β”œβ”€β”€ reports/iter-NNNN-trial-NN.txt # captured reports (preserved for debugging)
└── runs/<run_id>/                 # archive of prior runs (re-init bumps the current run here)

results.jsonl records: iter, hypothesis, trial_scores, avg, decision, kept, description, sha, timestamp. Treat it as a research log β€” predicted (hypothesis) vs. observed (avg, decision).

License

MIT.

About

Generalized autoresearch for Claude Code, packaged as an installable plugin πŸ”§

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors