Inspired by karpathy/autoresearch — but for any codebase, not just ML training loops.
Karpathy's autoresearch lets an AI agent run ML experiments overnight: modify train.py → measure val_bpb → keep if better, discard if worse → repeat. You wake up to a log of experiments and a better model.
autoimprove does the same thing for your codebase.
Give Claude Code your project, run /autoimprove:improve, and let it iterate autonomously. It proposes a targeted change, scores your codebase before and after using your own tooling (TypeScript, cargo clippy, pytest, golangci-lint — whatever you already have), keeps the changes that improve the score, reverts the ones that don't, and logs everything. You wake up to a readable log of what worked, what didn't, and a cleaner codebase.
propose → measure BEFORE → implement → measure AFTER → keep ✅ or discard ❌ → log → repeat
# 1. Add the marketplace and install the plugin
/plugin marketplace add benmarte/autoimprove
/plugin install autoimprove@autoimprove
# 2. Auto-detect your stack and see your codebase report
/autoimprove:setup
# 3. The audit shows what's wrong and offers to start fixing
# Or run the audit anytime for a fresh check
/autoimprove:audit
# 4. For unattended runs (e.g. overnight), use improve directly
/autoimprove:improve 20
# Or focus on a specific task
/autoimprove:improve 10 "Replace all any types with proper interfaces"
# 5. Review in the morning
cat .claude/autoimprove/log.md
git log --oneline # one commit per winning experiment
git show HEAD # inspect the latest winThat's it. No config required upfront — /autoimprove:setup fingerprints your project, writes .claude/autoimprove/config.md, and immediately runs an audit showing your codebase's deficiencies ranked by efficiency.
/autoimprove:upgradeThe plugin system caches marketplace clones locally. If your install predates the upgrade command, you need to update the marketplace clone first:
# 1. Update the marketplace clone
cd ~/.claude/plugins/marketplaces/autoimprove && git pull origin main
# 2. Reinstall the plugin
/plugin update autoimprove@autoimproveIf /plugin update still shows "already at the latest version", uninstall and reinstall:
/plugin uninstall autoimprove@autoimprove
/plugin install autoimprove@autoimproveAfter this, /autoimprove:upgrade will be available for all future updates.
autoimprove checks for new releases once per day on session start. If an update is available, you'll see:
Update available: v1.2.0 → v1.3.0
Run /autoimprove:upgrade to update.
The check is lightweight (single GitHub API call, 3s timeout, cached for 24 hours) and never blocks startup.
/autoimprove:setup scans your project root to detect:
- Language and framework
- Package manager (
npm,cargo,poetry,uv, etc.) - Test runner (
pytest,jest,go test,rspec, etc.) - Type checker (
tsc,mypy,pyright, etc.) - Linter (
eslint,ruff,golangci-lint,rubocop, etc.)
It writes an .claude/autoimprove/config.md file in your project root — a plain Markdown config that maps your specific tools to a 0–100 composite quality score. You can edit this file to customise the loop for your project.
Every experiment runs in a separate git worktree — its own directory, its own branch, completely isolated from your main codebase:
your-project/ ← main branch (never touched during experiments)
.claude/autoimprove/worktrees/ ← gitignored, auto-created
experiment-001/ ← branch: autoimprove/experiment-001
experiment-002/ ← branch: autoimprove/experiment-002
experiment-003/ ← branch: autoimprove/experiment-003
- ✅ Winning experiments get squash-merged back to main as a clean commit
- ❌ Losing experiments have the worktree and branch deleted — nothing touches main
- 🔒 Your working directory is read-only for the entire session
- 🧹 All worktrees are cleaned up automatically at session end
No more git checkout -- . rollbacks. No risk of a broken experiment corrupting your codebase.
Before diving into fixes, /autoimprove:audit scans your codebase and shows exactly what needs work:
━━━ Codebase Audit ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Current Score: 61/100
Type safety: 24/40 ██████░░░░ (16 pts to max)
Build: 20/20 ██████████ ✓ maxed
Tests: 10/30 ███░░░░░░░ (20 pts to max)
Lint: 7/10 ███████░░░ (3 pts to max)
━━━ Fastest Path to 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Area Gap Issues Est. iterations Efficiency
1 Type safety 16pts 8 errors 3 iterations 5.3 pts/iter ← best
2 Lint 3pts 2 warnings 1 iteration 3.0 pts/iter
3 Tests 20pts 0/4 covered 7 iterations 2.9 pts/iter
Total: ~11 iterations to reach 100/100
⚡ Estimated token usage: ~250K tokens (rough estimate, actual usage varies)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The audit ranks areas by efficiency — points gained per iteration — so you fix the highest-impact issues first. It then offers to start fixing interactively, area by area, or you can run /autoimprove:improve directly.
Setup auto-runs the audit after generating your config, so first-time users see this report immediately.
Every iteration, the loop measures your codebase on four axes:
| Metric | Weight | What it checks |
|---|---|---|
| Type / compile errors | 40 pts | tsc --noEmit, cargo check, go build, mypy, etc. |
| Build success | 20 pts | Does the project build without errors? |
| Test pass rate | 30 pts | (passing / total) × 30 |
| Lint errors | 10 pts | eslint, ruff, clippy, golangci-lint, etc. |
If a metric doesn't apply (no tests yet, no linter configured), its weight is redistributed across the others.
Each iteration prints visible progress so you always know what's happening:
━━━ Iteration 1/5 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔬 PROPOSE: Targeting error handling in src/api/client.ts
🔬 SNAPSHOT: Measuring BEFORE score...
🔬 IMPLEMENT: Adding try/catch to unhandled async calls
🔬 MEASURE: Measuring AFTER score...
🔬 DECIDE: 85 → 89 (+4 pts) — KEPT ✅
🔬 LOG: Recorded to .claude/autoimprove/log.md
Steps per iteration:
- Creates a fresh git worktree + branch (
autoimprove/experiment-NNN) - Proposes one bounded improvement with an explicit hypothesis — "I will fix the three unhandled promise rejections in
api/invoices.tsbecause I expect it to reduce TypeScript errors and improve the type score by ~8 points" - Measures the score inside the worktree (BEFORE)
- Implements the change inside the worktree (surgical — 1–3 files at most)
- Measures again (AFTER)
- Keeps — squash-merges to main and deletes the worktree — if AFTER ≥ BEFORE
- Discards — deletes the worktree and branch, main untouched — if AFTER < BEFORE
- Logs the result to
.claude/autoimprove/log.md
After each iteration, .claude/autoimprove/log.md gets an entry like:
## Iteration 4 — 2026-03-11 02:14
**Hypothesis:** Replace 3 `any` types in convex/invoices.ts with proper TypeScript interfaces
**Branch:** autoimprove/experiment-004
**Files changed:** convex/invoices.ts
**Before:** 74/100 — type: 28, build: 20, tests: 18, lint: 8
**After:** 82/100 — type: 36, build: 20, tests: 18, lint: 8
**Decision:** KEPT ✅ (squash-merged to main, worktree deleted)
**Reason:** Eliminated 2 TS errors by typing the invoice mutation arguments properly
| Command | Description |
|---|---|
/autoimprove:setup |
Detect stack, generate config, and run initial audit |
/autoimprove:audit |
Scan codebase for deficiencies and get a prioritized fix plan |
/autoimprove:improve [N] ["focus"] |
Run N iterations of the loop (default: 5), optionally focused on a specific task |
/autoimprove:continue [N] ["focus"] |
Resume an interrupted session — inherits remaining iterations and focus from the log |
/autoimprove:status |
Show a summary of all runs from .claude/autoimprove/log.md |
/autoimprove:upgrade |
Check for and install the latest version |
| Language | Type check | Build | Tests | Lint |
|---|---|---|---|---|
| TypeScript / JavaScript | tsc --noEmit |
npm/pnpm/yarn/bun build |
jest / vitest / mocha | eslint |
| Next.js / Nuxt / Remix / Astro | tsc --noEmit |
framework build cmd | jest / vitest | eslint |
| Python | mypy / pyright | — | pytest | ruff / flake8 / pylint |
| Go | go build ./... |
go build |
go test ./... |
golangci-lint / go vet |
| Rust | cargo check |
cargo build |
cargo test |
cargo clippy |
| Ruby | sorbet (if configured) | — | rspec / minitest | rubocop |
| Java / Kotlin | mvn compile / ./gradlew build |
same | mvn test / ./gradlew test |
checkstyle / ktlint |
| C# / .NET | dotnet build |
dotnet build |
dotnet test |
dotnet format --verify-no-changes |
| PHP | phpstan | — | phpunit | phpcs |
| Swift | swift build |
swift build |
swift test |
swiftlint |
| Any Makefile project | make check / make typecheck |
make build |
make test |
make lint |
Don't see your stack? Edit .claude/autoimprove/config.md after setup to add your own commands.
After running /autoimprove:setup, edit the generated .claude/autoimprove/config.md to tailor the loop to your project:
## Improvement Areas
- Check all Convex mutations have auth guards
- Replace fetch() calls with our internal apiClient wrapper
- Ensure every page component has a loading.tsx sibling
## Files to Never Modify
- convex/schema.ts
- src/generated/
- migrations/
- .env.localYou can also override any auto-detected command, change scoring weights, or add custom shell commands as additional metrics.
You can focus the loop on a specific task directly from the command — no config editing needed. Just pass a quoted string:
# Focus on type safety
/autoimprove:improve 10 "Replace all any types with proper TypeScript interfaces"
# Focus on a specific directory
/autoimprove:improve 5 "Fix all lint warnings in src/components/dashboard/"
# Focus on tests
/autoimprove:improve 10 "Add unit tests for every exported function in lib/billing/"
# Focus on a migration
/autoimprove:improve 20 "Replace all raw fetch() calls with the apiClient wrapper from lib/api-client.ts"When a focus string is provided, every iteration targets that task. The loop breaks it into file-by-file sub-tasks and chips away one per iteration until the focus is fully addressed or iterations run out.
Without a focus string, the loop rotates through all areas listed in your .claude/autoimprove/config.md as usual.
For recurring focus areas, you can also edit the Improvement Areas section in .claude/autoimprove/config.md directly:
## Improvement Areas
- Replace every `any` type with a proper TypeScript interface or type aliasThis is useful when you want the focus to persist across multiple sessions without re-typing it.
- Be specific.
"Fix type errors"is vague."Replace any with proper types in convex/ mutations"gives the loop a clear target. - One concern at a time works best. The loop makes surgical 1–3 file changes per iteration — a narrow focus means every iteration chips away at the same problem.
- Match iteration count to scope. If you have ~20 files to fix, run
/autoimprove:improve 20 "..."so each iteration can tackle one file. - Use "Files to Never Modify" in the config to protect areas you don't want touched during a focused run.
If your session gets interrupted (Ctrl+C, context limit, crash), you can pick up where you left off:
# Resume with remaining iterations and same focus
/autoimprove:continue
# Resume but only run 3 more iterations
/autoimprove:continue 3
# Resume with a different focus
/autoimprove:continue "New focus area"
# Override both
/autoimprove:continue 5 "Fix error handling in api/"The continue command reads .claude/autoimprove/log.md to find the interrupted session, inherits its settings, and picks up from the next iteration. Iteration numbering continues seamlessly (e.g., if you completed 4/10, it resumes at 5/10).
If the codebase has changed since the interrupted session (you made manual commits), autoimprove will warn you and re-measure the baseline.
Check /autoimprove:status to see if you have an interrupted session to resume.
The loop rotates through these universal improvement areas (and adds language-specific ones based on your stack):
- Type safety — fix type errors, replace
any/interface{}/untyped constructs - Error handling — unhandled promises, bare
catch {}, swallowed errors - Dead code — unused imports, variables, unreachable branches
- Code duplication — extract repeated logic (3+ occurrences) into shared utilities
- Naming & readability — cryptic names, functions over ~50 lines
- Performance — N+1 query patterns, missing memoization, unnecessary allocations
- Security — hardcoded secrets, missing input validation, unguarded auth routes
- Tests — add a test for the most critical untested function, fix flaky tests
The loop is designed to be safe to run unattended:
| Rule | Detail |
|---|---|
| 🔒 Never touches lock files | package-lock.json, Cargo.lock, go.sum, Gemfile.lock, etc. |
| 🔒 Never touches generated files | Migrations, protobuf output, OpenAPI generated code |
| 🔒 Never touches secrets | .env, .env.local, any secrets file |
| 🔒 Never deploys or publishes | No git push, npm publish, cargo publish, etc. |
| 🔒 Requires clean git state | Won't start if git status shows uncommitted changes |
| 🔒 Experiments in isolated worktrees | Each experiment is on its own branch — main is never modified mid-session |
| 🔒 Losers deleted, not rolled back | Failed experiments: worktree deleted, branch deleted, main untouched |
| 🔒 Winners squash-merged | One clean commit per winning experiment — easy to review with git log |
| 🔒 Pauses every 10 iterations | Cleans up worktrees, writes summary, waits for human review |
You always review and push — the loop never commits or pushes on your behalf.
autoimprove/
├── .claude-plugin/
│ ├── plugin.json # Plugin manifest
│ └── hooks/
│ └── hooks.json # SessionStart hook registration
├── hooks/
│ └── sessionstart.sh # update check on startup (once per day)
├── skills/
│ ├── audit/
│ │ └── SKILL.md # Codebase deficiency scan, prioritized report, interactive fix loop
│ ├── detect-stack/
│ │ └── SKILL.md # Fingerprints project, writes .claude/autoimprove/config.md
│ ├── worktree/
│ │ └── SKILL.md # Creates/manages/cleans up git worktrees per experiment
│ ├── improve-loop/
│ │ └── SKILL.md # Core loop: worktree → propose → implement → measure → merge/delete
│ ├── measure/
│ │ └── SKILL.md # Internal scoring utility (used by audit and improve-loop)
│ └── rollback/
│ └── SKILL.md # Emergency cleanup of all experiment worktrees
└── commands/
├── audit.md # /autoimprove:audit
├── continue.md # /autoimprove:continue [N] ["focus"]
├── setup.md # /autoimprove:setup
├── improve.md # /autoimprove:improve [N] ["focus"]
├── status.md # /autoimprove:status
└── upgrade.md # /autoimprove:upgrade (check for updates)
Here's what a real overnight session looks like. This is from a Next.js + Convex project starting at a score of 61/100:
## Iteration 1 — 23:04
**Hypothesis:** Replace 4 implicit `any` types in `convex/invoices.ts` with proper interfaces
**Files changed:** convex/invoices.ts
**Before:** 61/100 — type: 24, build: 20, tests: 10, lint: 7
**After:** 69/100 — type: 32, build: 20, tests: 10, lint: 7
**Decision:** KEPT ✅
**Reason:** Removed 4 TS7006 implicit-any errors by typing mutation arguments
## Iteration 5 — 23:37
**Hypothesis:** Move ExpenseList to a server component — it only reads data, no interactivity
**Branch:** autoimprove/experiment-005
**Files changed:** components/ExpenseList.tsx
**Before:** 71/100 — type: 32, build: 20, tests: 10, lint: 9
**After:** 68/100 — type: 26, build: 20, tests: 10, lint: 12
**Decision:** DISCARDED ❌ (worktree deleted, main untouched)
**Reason:** Removing "use client" broke useQuery hook — must stay client component.
## Iteration 8 — 00:02
**Hypothesis:** Add unit tests for calculateTaxEstimate() — most complex function, zero coverage
**Files changed:** lib/tax.test.ts (new)
**Before:** 78/100 — type: 36, build: 20, tests: 10, lint: 10
**After:** 84/100 — type: 36, build: 20, tests: 16, lint: 10
**Decision:** KEPT ✅
**Reason:** 2 new tests passing, covers basic and edge-case tax bracket logic
━━━ Session Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Score: 61 → 84 (+23 pts)
🔁 Iterations: 10 total — 9 kept ✅, 1 discarded ❌
📝 Merged commits:
• abc1234 autoimprove(001): Replace 4 implicit any types
• def5678 autoimprove(002): Add error boundaries
• ...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
See autoimprove-log.example.md for the full 10-iteration session with summary table.
PRs welcome! Especially:
- New language profiles in
detect-stack/SKILL.md - Better improvement area prompts for specific frameworks
- Example
.claude/autoimprove/config.mdfiles for common stacks
MIT
