An autonomous-researcher contract. Based on karpathy/autoresearch, adapted
for modgrad's Rust codebase. The driving agent (Claude Code with
--dangerously-skip-permissions) reads this file, edits code, runs one
5-minute experiment, decides keep or revert, and loops indefinitely.
Before starting the experiment loop, agree with the human on:
- Run tag: propose one based on today's date (e.g.
apr19). The branchautoresearch/<tag>must not already exist. Create it frommain:git checkout -b autoresearch/<tag>. - Validation file:
val.txtmust exist at repo root. If it doesn't, create a stable split once off the main training corpus:Do not re-createtail -c 262144 train_climbmix.txt > val.txtval.txton later runs — that changes the metric denominator and makes runs incomparable. - Baseline build works:
cargo build --release -p isiscompletes with no errors. - Baseline train + eval works:
cargo build --release -p isis && \ ./target/release/isis learn-ffn --budget 300 --val-data val.txt \ /tmp/autoresearch_ckpt.bin train_climbmix_5m.txt > run.log 2>&1 && \ grep "^val_bpb:\|^peak_vram_mb:" run.log results.tsvhas a header: if missing, create it (see below).- Confirm and go: briefly summarise what you see and wait for confirmation, then start the loop.
You edit these directories. Anything in them is fair game: architecture, hyperparameters, optimizer internals, training step body.
crates/modgrad-ffn/src/**— the FFN cerebellum model (architecture, forward/backward, optimizer state for the FFN).crates/modgrad-ctm/src/**— CTM regional graph (only relevant if you plan experiments withisis learn, notisis learn-ffn).isis/src/main.rs— training loop body, model-size selection, dream-phase cadence inside thelearn_ffnfunction. Don't touch themain()CLI parsing or theeval_ffnfunction — those are part of the evaluation harness.
Hands off. These define the ground truth; mutating them invalidates the val_bpb comparison.
crates/modgrad-training/src/metrics.rs—BitsPerByteand friends.crates/modgrad-training/src/autoresearch.rs— the summary format the agent grep-parses.crates/modgrad-training/src/checkpoint_bundle.rs— checkpoint format.crates/modgrad-device/src/**— GPU kernels. Fragile; one bad dispatch hangs gfx1102.isis/src/main.rs— only theeval_ffnfunction andcompute_ffn_val_bpb. Read-only.val.txt— the validation set. Re-creating this changes the metric.Cargo.tomldependencies — no adding crates. Work within what's already declared.
val_bpb (validation bits-per-byte). Lower is better. It's the only
number that decides keep vs. discard. Computed as
cross_entropy_in_nats / n_targets / ln(2) over the full val.txt
with non-overlapping context-length windows. Tokenizer-agnostic so
architectural changes are fairly compared.
VRAM is a soft constraint. Some growth is acceptable for real val_bpb gains. Don't blow up VRAM dramatically.
Simplicity breaks ties. A 0.001 bpb gain that adds 20 lines of hacky code is probably not worth keeping. A 0 bpb change that deletes code is a keep. Err toward simpler.
Two tasks share the same autoresearch contract (same val_bpb: grep,
same keep/discard decision rule). Pick one per branch — don't interleave
them in the same branch, because the two val_bpb numbers mean different
things and aren't directly comparable.
-
Language (
isis learn-ffn): FFN cerebellum trained on byte-level text.val_bpb= cross-entropy per byte onval.txt. Standard LM-style iteration: architecture, LR schedule, activation, width. Results go inresults-language.tsv. -
Mazes (
mazesbinary): CTM trained on route prediction from maze pixels.val_bpb=1 - first_step_accuracy(0 = perfect routing, 1 = random guess). Iteration focus: CTM ticks, d_model, exit strategy, loss mix. Results go inresults-mazes.tsv.
The "CAN change" / "CANNOT change" sets below depend on which task you pick; see the per-task subsections under "Running one experiment".
cargo build --release -p isis && \
./target/release/isis learn-ffn --budget 300 --val-data val.txt \
/tmp/autoresearch_ffn.bin train_climbmix_5m.txt > run.log 2>&1
Flags:
--budget 300— 5-minute wall-clock training cap.--val-data val.txt— triggers end-of-run val_bpb eval + summary./tmp/autoresearch_ffn.bin— throwaway path, fresh file per run (do not reuse a checkpoint across experiments — biases the start).
Task-specific mutable set (in addition to the global "CAN change" list):
crates/modgrad-ffn/src/**isis/src/main.rs→learn_ffnfunction body only- Do not touch
eval_ffn/compute_ffn_val_bpb— evaluation harness.
cargo build --release -p mazes && \
./target/release/mazes --size 21 --ticks 16 --budget 300 --batch 8 \
--autoresearch-summary > run.log 2>&1
Flags:
--budget 300— 5-minute wall-clock cap, same contract as learn-ffn.--steps Nstill works as an upper bound; whichever fires first wins.--autoresearch-summary— triggers the summary print after the 200-maze eval.--size/--ticks/--d-model— task knobs, fair game to mutate.- Add
--brainto iterate on the 8-region CTM instead of the single-CTM baseline. Keep runs of the two variants in separate branches; the per-run summary's trailer saystask=mazesvstask=mazes-brain.
Task-specific mutable set:
crates/modgrad-ctm/src/**— the CTM regional graph and core.examples/mazes/src/main.rs→ training body only.- Do not touch the
eval()function inexamples/mazes/src/main.rsorexamples/mazes/src/maze_gen.rs— evaluation harness.
grep "^val_bpb:\|^peak_vram_mb:" run.log
If the grep output is empty the run crashed. tail -n 80 run.log for
the compiler or runtime error. Dumb bugs (typo, missing import) — fix
and re-run. Fundamentally broken idea — log as crash and move on.
results-language.tsv or results-mazes.tsv at repo root, depending
on the task this branch is iterating on. Tab-separated (commas break
descriptions). Header row (identical for both files):
commit val_bpb memory_gb status description
Keep tasks in separate files because the val_bpb numbers aren't
directly comparable across tasks — mazes reports 1 - accuracy, language
reports actual cross-entropy per byte.
Five columns:
git rev-parse --short HEAD(7 chars).- val_bpb, six decimals, e.g.
1.234567. Use0.000000for crashes. - peak memory in GB,
.1f. Dividepeak_vram_mbby 1024. Use0.0for CPU-only or crashes. - Status:
keep,discard, orcrash. - Short human-readable description of the experimental change.
Example:
commit val_bpb memory_gb status description
a1b2c3d 1.234567 0.0 keep baseline
b2c3d4e 1.220100 0.0 keep widen hidden from 2048 to 3072
c3d4e5f 1.245900 0.0 discard replace SwiGLU with GeLU
d4e5f6g 0.000000 0.0 crash quadruple model width (OOM)
Do NOT commit results-*.tsv. Already in .gitignore. The file is
your scratch log, not repo history; git commit messages carry the keep
decisions.
On each iteration:
- Check git state: current branch/commit.
- Plan one change to a file in the "CAN change" list. State the hypothesis in one sentence before editing.
- Edit the code.
- Commit:
git add -A && git commit -m "<short description>". - Check you didn't touch read-only files:
If it reports a violation,
./scripts/autoresearch-check.sh HEAD~1git reset --hard HEAD~1and try a different mutation. The ground-truth evaluation harness is sacred; accidentally editing it invalidates every priorval_bpbrow inresults.tsv. - Delete the prior checkpoint so the new run starts fresh
(resuming a prior checkpoint biases the val_bpb). For the language
task:
Mazes trains from scratch every call and has no checkpoint to delete.
rm -f /tmp/autoresearch_ffn.bin - Run (pick the command for this branch's task — see "Running one
experiment" above for both). Default (language):
cargo build --release -p isis && \ ./target/release/isis learn-ffn --budget 300 --val-data val.txt \ /tmp/autoresearch_ffn.bin train_climbmix_5m.txt > run.log 2>&1 - Extract results:
grep "^val_bpb:\|^peak_vram_mb:" run.log. - Log to results.tsv.
- Decide:
- val_bpb strictly lower than the current branch tip → keep (do nothing, the commit stands).
- val_bpb equal or higher → discard:
git reset --hard HEAD~1. - crash → discard:
git reset --hard HEAD~1, logcrash.
Repeat. Forever.
- Never run
cargo cleanor deletetarget/. Incremental rebuilds depend on the warm compile cache; blowing it away costs ~5 minutes per loop iteration and burns the subscription quota. - Never skip git hooks (
--no-verify,--no-gpg-sign). If a hook fails, investigate and fix, don't bypass. - Never
git push. The experiment branch is local-only until the human decides what to upstream. - 10-minute hard timeout: if an experiment exceeds 10 minutes total, kill it and treat it as a crash. Budget is 5 min; 10 min is 2× as a safety margin for compile cold starts.
Once the experiment loop has begun, do not pause to ask "should I keep going?" or "is this a good stopping point?". The human might be asleep or away and expects the loop to run indefinitely until manually stopped (Ctrl+C / branch abandoned / human interrupts).
If you run out of ideas, think harder: re-read these files for new angles, combine near-misses, try more radical architectural changes (different normalization, different gating, different width/depth tradeoffs). The loop runs until the human stops it, period.
- Width vs depth: same param budget reshuffled.
- Tied vs untied input/output embeddings.
- SwiGLU vs GeGLU vs plain GeLU.
- Learning-rate schedule shape (warmup length, decay curve).
- Pre-norm vs post-norm placement.
- AdamW vs Muon on the linear layers.
- Batch size / gradient-accumulation at fixed token throughput.
Most ideas won't work. Discard cheerfully.