AutoResearch

A Claude Code plugin for iterative optimization through automated evaluation.

AutoResearch evaluates an artifact (prompt, code, config, or anything else) against a test suite, analyzes failures, generates targeted variants, and promotes winners — repeating until it hits your target pass rate.

How It Works

Each optimization cycle:

Assess — Run the current artifact against all test cases using binary assertions
Analyze — Identify which assertions fail most and what patterns cause failures
Generate — Create 3 candidate variants, each changing exactly ONE thing
Compare — Assess all candidates against the full test suite
Promote — If a candidate beats the current best, it becomes the new baseline
Repeat — Continue until pass rate exceeds 90% or 15 cycles are exhausted

Setup

Install the plugin

In Claude Code, add this repo as a plugin marketplace, then install:

/plugin marketplace add sighup/autoresearch
/plugin install autoresearch@autoresearch

For local development, point Claude Code at your clone:

/plugin marketplace add /path/to/autoresearch
/plugin install autoresearch@autoresearch

Prerequisites

Python 3.10+
uv (for automatic dependency management)
An ANTHROPIC_API_KEY environment variable (only required for prompt mode — not needed when using a custom runner)

The Agent SDK is installed automatically into .autoresearch/.venv on first run when using prompt mode.

Configure your optimization target

You need three things (and optionally a fourth):

1. An artifact

The thing you want to optimize — a prompt file, source code, config, or any file. It can live anywhere in your project.

2. Test cases

A JSONL file with one test case per line. Each line is a JSON object with id, input, and category:

{"id": "api-health", "input": "Add a /health endpoint to our Express.js API that returns server status and uptime.", "category": "api"}
{"id": "cli-export", "input": "Add a --format flag to our CLI tool for JSON and CSV export.", "category": "cli"}

3. Assertions

A Python file defining binary assertion functions. Each function takes the runner's output as a string and returns True or False. Register them in an ASSERTIONS list:

import re

def assert_has_summary(response: str) -> bool:
    """Response contains a Summary section."""
    return bool(re.search(r"## Summary", response, re.IGNORECASE))

def assert_min_length(response: str) -> bool:
    """Response is at least 500 characters."""
    return len(response.strip()) >= 500

ASSERTIONS = [
    assert_has_summary,
    assert_min_length,
]

4. A custom runner (optional)

For non-prompt artifacts, provide a shell command that assesses your artifact. It receives context via environment variables:

AUTORESEARCH_ARTIFACT — path to the artifact being optimized
AUTORESEARCH_TEST_ID — test case ID
AUTORESEARCH_TEST_INPUT — test case input text

Its stdout becomes the response text that assertions grade. Exit 0 on success; non-zero is treated as an error.

Concurrency requirement: Your runner may be invoked concurrently for different test cases (one subprocess per test case, running simultaneously). Use the AUTORESEARCH_TEST_ID environment variable to isolate per-run state — write to test-specific temp directories, use separate database transactions, etc. If your runner cannot handle concurrent invocation, set "parallel": false in your config (see below).

These files can live anywhere in your project. Point to them from .autoresearch/config.json:

Prompt mode (default):

{
  "artifact": "src/prompts/summarizer.txt",
  "assertions": "tests/summarizer_assertions.py",
  "test_cases": "tests/summarizer_cases.jsonl"
}

Custom runner mode:

{
  "artifact": "pytest.ini",
  "runner": "bash ./run_tests_timed.sh",
  "assertions": "tests/perf_assertions.py",
  "test_cases": "tests/perf_cases.jsonl"
}

Parallelism: Test cases within a variant are assessed concurrently by default in prompt mode, and sequentially in custom runner mode. Override this with the "parallel" config field:

{
  "artifact": "pytest.ini",
  "runner": "bash ./run_tests_timed.sh",
  "parallel": true,
  "assertions": "tests/perf_assertions.py",
  "test_cases": "tests/perf_cases.jsonl"
}

Usage

/autoresearch                                        # asks for artifact path
/autoresearch find                                   # scan repo for candidates
/autoresearch src/prompts/summarizer.txt             # optimize this prompt
/autoresearch src/prompts/summarizer.txt target 95%  # with a goal
/autoresearch pytest.ini                             # optimize non-prompt (will ask for runner)
/autoresearch clean                                  # clean up .autoresearch/

Claude will set up .autoresearch/, establish a baseline, then iterate through cycles of analysis, variant generation, and evaluation. All working state stays inside .autoresearch/.

Project Layout

The plugin keeps all its working state in a single directory:

your-project/
  .autoresearch/
    config.json              # Points to artifact, assertions, test cases, optional runner
    assertions.py            # Assertions (if not located elsewhere)
    test_cases.jsonl         # Test cases (if not located elsewhere)
    prompts/                 # (prompt mode only)
      current.txt            # Working copy of the prompt
      candidates/            # Variant prompts generated each cycle
      history/               # Archived previous versions with scores
    history/                 # (custom runner mode) archived artifact snapshots
    results/
      current/               # Per-test-case results for current artifact
      v1a/                   # Per-test-case results for candidate v1a
      summary_current.json   # Summarized results for current artifact
      scores.json            # Historical score tracking (baseline + promoted winners)
      failure_analysis.txt   # Failure analysis written each cycle

Writing Good Assertions

Test one thing — Each assertion checks a single, concrete property
Stay binary — No partial credit; pass or fail
Prefer structure over quality — Check that sections exist, formats match, and constraints hold rather than judging subjective quality
Use regex — Pattern matching handles formatting variation well
Name clearly — assert_has_proof_artifacts is better than assert_check_3

Writing Good Test Cases

Cover your categories — Include test cases from each domain the prompt should handle
Vary complexity — Mix simple and complex inputs
Include edge cases — Add cases where the prompt is likely to fail
Use realistic inputs — The closer to real usage, the more useful the optimization

Non-Determinism

In prompt mode, Claude's responses vary between runs, even with identical prompts and inputs. This means pass rates will fluctuate — a prompt scoring 80% on one run might score 70% or 90% on the next.

With a sufficiently large test suite (15+ cases), individual variance tends to average out across the suite, making overall pass rates relatively stable. However, small differences between candidates (e.g., 75% vs 78%) may not be meaningful.

Tips for working with this:

Don't over-index on small margins — A 2-3% difference could be noise
Use more test cases — Larger suites produce more stable results
Look at assertion patterns, not just totals — If a candidate consistently fixes a specific assertion across multiple test cases, that's a real signal even if the overall pass rate is close

In custom runner mode, results may be more deterministic (e.g., test execution time is measurable), but external factors (system load, caching) can still introduce variance.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.claude-plugin		.claude-plugin
agents		agents
bin		bin
skills/autoresearch		skills/autoresearch
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoResearch

How It Works

Setup

Install the plugin

Prerequisites

Configure your optimization target

1. An artifact

2. Test cases

3. Assertions

4. A custom runner (optional)

Usage

Project Layout

Writing Good Assertions

Writing Good Test Cases

Non-Determinism

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoResearch

How It Works

Setup

Install the plugin

Prerequisites

Configure your optimization target

1. An artifact

2. Test cases

3. Assertions

4. A custom runner (optional)

Usage

Project Layout

Writing Good Assertions

Writing Good Test Cases

Non-Determinism

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages