Version: 1.0 | Source: Karpathy's autoresearch, adapted for production Claude Code workflows
An iterate-and-measure pattern for autonomous quality improvement. The agent modifies one artifact, scores it against binary criteria, keeps improvements, reverts regressions, and repeats until the score ceiling is hit or iterations are exhausted.
When to use: Any step that produces a scorable artifact. NOT for exploratory or creative work where "better" can't be measured.
Every refinement loop MUST define all 7 before starting iteration:
What "better" means, expressed as a number. Example: "keyword score >= 85%" or "checklist completeness = 6/6".
Exactly ONE file or section being modified. The evaluator, criteria, and all other files are READ-ONLY during the loop. This prevents scope creep and makes every iteration comparable.
Binary criteria (Y/N) are strongly preferred over subjective ratings. A checklist of 5-6 yes/no questions produces a score out of N that is deterministic and reproducible. Subjective 1-10 ratings drift across iterations.
- Keep if total score improves (even by 1 point)
- Revert if score stays the same or drops
- Crash (skip iteration) if the modification breaks structure or introduces errors
- Plateau breaker: After 3 consecutive reverts, discard current approach and regenerate from scratch using only the criteria + failure history (not the current artifact text)
Each iteration logs a structured record (append-only):
Iteration | Score | Delta | Verdict | Description
1 | 3/5 | +3 | KEEP | Added quantified metric
2 | 3/5 | 0 | REVERT | Vocabulary swap, no score change
3 | 4/5 | +1 | KEEP | Led with outcome number
Log is written to refinement_log.md in the relevant project folder.
- Time-box: Maximum duration before presenting results (default: 10 minutes)
- Iteration cap: Maximum iterations (default: 8)
- Never pause to ask during the loop -- finish iterations, then present results with the log
- Re-read from disk each cycle -- the file on disk is truth, not conversation memory
If two versions score the same: keep the shorter/simpler one. A +0 change that adds complexity is a regression.
Non-negotiable for any refinement loop:
- One file scope -- bounded blast radius
- One metric -- deterministic decisions, no "feels better"
- Git checkpoint -- every iteration starts from a known state
- Separate generator and evaluator -- the model writing the modification must NOT score its own work
- Time-boxed -- hard cap on iterations AND wall-clock time
- Audit trail -- the log captures every decision for user review
- NOT a new skill (it's a sub-step pattern inside existing skills)
- NOT for first-draft generation (the artifact must already exist before the loop starts)
- NOT for subjective quality (if you can't define binary criteria, don't use the loop)
- NOT a replacement for human review (the loop improves the draft; the user still approves)
# Pseudo-code showing the loop structure
goal = "checklist_score >= 5/6"
iteration_cap = 8
artifact_path = "output/current_draft.md"
for i in range(iteration_cap):
draft = read(artifact_path)
# Generator: Claude modifies the artifact
modified = generate_improvement(draft, criteria, failure_history)
# Evaluator: Ollama (deepseek-r1:14b) scores against binary criteria
score = ollama_evaluate(modified, criteria) # returns N/6
if score > current_score:
write(artifact_path, modified) # KEEP
log(i, score, "+1", "KEEP", "description")
else:
log(i, score, "0", "REVERT", "no improvement")
consecutive_reverts += 1
if consecutive_reverts >= 3:
# Plateau breaker: regenerate from criteria only
modified = regenerate_from_scratch(criteria, failure_history)
consecutive_reverts = 0
if score >= target:
break