A Claude Code plugin for iterative optimization through automated evaluation.
AutoResearch evaluates an artifact (prompt, code, config, or anything else) against a test suite, analyzes failures, generates targeted variants, and promotes winners — repeating until it hits your target pass rate.
Each optimization cycle:
- Assess — Run the current artifact against all test cases using binary assertions
- Analyze — Identify which assertions fail most and what patterns cause failures
- Generate — Create 3 candidate variants, each changing exactly ONE thing
- Compare — Assess all candidates against the full test suite
- Promote — If a candidate beats the current best, it becomes the new baseline
- Repeat — Continue until pass rate exceeds 90% or 15 cycles are exhausted
In Claude Code, add this repo as a plugin marketplace, then install:
/plugin marketplace add sighup/autoresearch
/plugin install autoresearch@autoresearch
For local development, point Claude Code at your clone:
/plugin marketplace add /path/to/autoresearch
/plugin install autoresearch@autoresearch
- Python 3.10+
- uv (for automatic dependency management)
- An
ANTHROPIC_API_KEYenvironment variable (only required for prompt mode — not needed when using a custom runner)
The Agent SDK is installed automatically into .autoresearch/.venv on first run when using prompt mode.
You need three things (and optionally a fourth):
The thing you want to optimize — a prompt file, source code, config, or any file. It can live anywhere in your project.
A JSONL file with one test case per line. Each line is a JSON object with id, input, and category:
{"id": "api-health", "input": "Add a /health endpoint to our Express.js API that returns server status and uptime.", "category": "api"}
{"id": "cli-export", "input": "Add a --format flag to our CLI tool for JSON and CSV export.", "category": "cli"}A Python file defining binary assertion functions. Each function takes the runner's output as a string and returns True or False. Register them in an ASSERTIONS list:
import re
def assert_has_summary(response: str) -> bool:
"""Response contains a Summary section."""
return bool(re.search(r"## Summary", response, re.IGNORECASE))
def assert_min_length(response: str) -> bool:
"""Response is at least 500 characters."""
return len(response.strip()) >= 500
ASSERTIONS = [
assert_has_summary,
assert_min_length,
]For non-prompt artifacts, provide a shell command that assesses your artifact. It receives context via environment variables:
AUTORESEARCH_ARTIFACT— path to the artifact being optimizedAUTORESEARCH_TEST_ID— test case IDAUTORESEARCH_TEST_INPUT— test case input text
Its stdout becomes the response text that assertions grade. Exit 0 on success; non-zero is treated as an error.
Concurrency requirement: Your runner may be invoked concurrently for different test cases (one subprocess per test case, running simultaneously). Use the AUTORESEARCH_TEST_ID environment variable to isolate per-run state — write to test-specific temp directories, use separate database transactions, etc. If your runner cannot handle concurrent invocation, set "parallel": false in your config (see below).
These files can live anywhere in your project. Point to them from .autoresearch/config.json:
Prompt mode (default):
{
"artifact": "src/prompts/summarizer.txt",
"assertions": "tests/summarizer_assertions.py",
"test_cases": "tests/summarizer_cases.jsonl"
}Custom runner mode:
{
"artifact": "pytest.ini",
"runner": "bash ./run_tests_timed.sh",
"assertions": "tests/perf_assertions.py",
"test_cases": "tests/perf_cases.jsonl"
}Parallelism: Test cases within a variant are assessed concurrently by default in prompt mode, and sequentially in custom runner mode. Override this with the "parallel" config field:
{
"artifact": "pytest.ini",
"runner": "bash ./run_tests_timed.sh",
"parallel": true,
"assertions": "tests/perf_assertions.py",
"test_cases": "tests/perf_cases.jsonl"
}/autoresearch # asks for artifact path
/autoresearch find # scan repo for candidates
/autoresearch src/prompts/summarizer.txt # optimize this prompt
/autoresearch src/prompts/summarizer.txt target 95% # with a goal
/autoresearch pytest.ini # optimize non-prompt (will ask for runner)
/autoresearch clean # clean up .autoresearch/
Claude will set up .autoresearch/, establish a baseline, then iterate through cycles of analysis, variant generation, and evaluation. All working state stays inside .autoresearch/.
The plugin keeps all its working state in a single directory:
your-project/
.autoresearch/
config.json # Points to artifact, assertions, test cases, optional runner
assertions.py # Assertions (if not located elsewhere)
test_cases.jsonl # Test cases (if not located elsewhere)
prompts/ # (prompt mode only)
current.txt # Working copy of the prompt
candidates/ # Variant prompts generated each cycle
history/ # Archived previous versions with scores
history/ # (custom runner mode) archived artifact snapshots
results/
current/ # Per-test-case results for current artifact
v1a/ # Per-test-case results for candidate v1a
summary_current.json # Summarized results for current artifact
scores.json # Historical score tracking (baseline + promoted winners)
failure_analysis.txt # Failure analysis written each cycle
- Test one thing — Each assertion checks a single, concrete property
- Stay binary — No partial credit; pass or fail
- Prefer structure over quality — Check that sections exist, formats match, and constraints hold rather than judging subjective quality
- Use regex — Pattern matching handles formatting variation well
- Name clearly —
assert_has_proof_artifactsis better thanassert_check_3
- Cover your categories — Include test cases from each domain the prompt should handle
- Vary complexity — Mix simple and complex inputs
- Include edge cases — Add cases where the prompt is likely to fail
- Use realistic inputs — The closer to real usage, the more useful the optimization
In prompt mode, Claude's responses vary between runs, even with identical prompts and inputs. This means pass rates will fluctuate — a prompt scoring 80% on one run might score 70% or 90% on the next.
With a sufficiently large test suite (15+ cases), individual variance tends to average out across the suite, making overall pass rates relatively stable. However, small differences between candidates (e.g., 75% vs 78%) may not be meaningful.
Tips for working with this:
- Don't over-index on small margins — A 2-3% difference could be noise
- Use more test cases — Larger suites produce more stable results
- Look at assertion patterns, not just totals — If a candidate consistently fixes a specific assertion across multiple test cases, that's a real signal even if the overall pass rate is close
In custom runner mode, results may be more deterministic (e.g., test execution time is measurable), but external factors (system load, caching) can still introduce variance.