Skip to content

sighup/autoresearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoResearch

A Claude Code plugin for iterative optimization through automated evaluation.

AutoResearch evaluates an artifact (prompt, code, config, or anything else) against a test suite, analyzes failures, generates targeted variants, and promotes winners — repeating until it hits your target pass rate.

How It Works

Each optimization cycle:

  1. Assess — Run the current artifact against all test cases using binary assertions
  2. Analyze — Identify which assertions fail most and what patterns cause failures
  3. Generate — Create 3 candidate variants, each changing exactly ONE thing
  4. Compare — Assess all candidates against the full test suite
  5. Promote — If a candidate beats the current best, it becomes the new baseline
  6. Repeat — Continue until pass rate exceeds 90% or 15 cycles are exhausted

Setup

Install the plugin

In Claude Code, add this repo as a plugin marketplace, then install:

/plugin marketplace add sighup/autoresearch
/plugin install autoresearch@autoresearch

For local development, point Claude Code at your clone:

/plugin marketplace add /path/to/autoresearch
/plugin install autoresearch@autoresearch

Prerequisites

  • Python 3.10+
  • uv (for automatic dependency management)
  • An ANTHROPIC_API_KEY environment variable (only required for prompt mode — not needed when using a custom runner)

The Agent SDK is installed automatically into .autoresearch/.venv on first run when using prompt mode.

Configure your optimization target

You need three things (and optionally a fourth):

1. An artifact

The thing you want to optimize — a prompt file, source code, config, or any file. It can live anywhere in your project.

2. Test cases

A JSONL file with one test case per line. Each line is a JSON object with id, input, and category:

{"id": "api-health", "input": "Add a /health endpoint to our Express.js API that returns server status and uptime.", "category": "api"}
{"id": "cli-export", "input": "Add a --format flag to our CLI tool for JSON and CSV export.", "category": "cli"}

3. Assertions

A Python file defining binary assertion functions. Each function takes the runner's output as a string and returns True or False. Register them in an ASSERTIONS list:

import re

def assert_has_summary(response: str) -> bool:
    """Response contains a Summary section."""
    return bool(re.search(r"## Summary", response, re.IGNORECASE))

def assert_min_length(response: str) -> bool:
    """Response is at least 500 characters."""
    return len(response.strip()) >= 500

ASSERTIONS = [
    assert_has_summary,
    assert_min_length,
]

4. A custom runner (optional)

For non-prompt artifacts, provide a shell command that assesses your artifact. It receives context via environment variables:

  • AUTORESEARCH_ARTIFACT — path to the artifact being optimized
  • AUTORESEARCH_TEST_ID — test case ID
  • AUTORESEARCH_TEST_INPUT — test case input text

Its stdout becomes the response text that assertions grade. Exit 0 on success; non-zero is treated as an error.

Concurrency requirement: Your runner may be invoked concurrently for different test cases (one subprocess per test case, running simultaneously). Use the AUTORESEARCH_TEST_ID environment variable to isolate per-run state — write to test-specific temp directories, use separate database transactions, etc. If your runner cannot handle concurrent invocation, set "parallel": false in your config (see below).

These files can live anywhere in your project. Point to them from .autoresearch/config.json:

Prompt mode (default):

{
  "artifact": "src/prompts/summarizer.txt",
  "assertions": "tests/summarizer_assertions.py",
  "test_cases": "tests/summarizer_cases.jsonl"
}

Custom runner mode:

{
  "artifact": "pytest.ini",
  "runner": "bash ./run_tests_timed.sh",
  "assertions": "tests/perf_assertions.py",
  "test_cases": "tests/perf_cases.jsonl"
}

Parallelism: Test cases within a variant are assessed concurrently by default in prompt mode, and sequentially in custom runner mode. Override this with the "parallel" config field:

{
  "artifact": "pytest.ini",
  "runner": "bash ./run_tests_timed.sh",
  "parallel": true,
  "assertions": "tests/perf_assertions.py",
  "test_cases": "tests/perf_cases.jsonl"
}

Usage

/autoresearch                                        # asks for artifact path
/autoresearch find                                   # scan repo for candidates
/autoresearch src/prompts/summarizer.txt             # optimize this prompt
/autoresearch src/prompts/summarizer.txt target 95%  # with a goal
/autoresearch pytest.ini                             # optimize non-prompt (will ask for runner)
/autoresearch clean                                  # clean up .autoresearch/

Claude will set up .autoresearch/, establish a baseline, then iterate through cycles of analysis, variant generation, and evaluation. All working state stays inside .autoresearch/.

Project Layout

The plugin keeps all its working state in a single directory:

your-project/
  .autoresearch/
    config.json              # Points to artifact, assertions, test cases, optional runner
    assertions.py            # Assertions (if not located elsewhere)
    test_cases.jsonl         # Test cases (if not located elsewhere)
    prompts/                 # (prompt mode only)
      current.txt            # Working copy of the prompt
      candidates/            # Variant prompts generated each cycle
      history/               # Archived previous versions with scores
    history/                 # (custom runner mode) archived artifact snapshots
    results/
      current/               # Per-test-case results for current artifact
      v1a/                   # Per-test-case results for candidate v1a
      summary_current.json   # Summarized results for current artifact
      scores.json            # Historical score tracking (baseline + promoted winners)
      failure_analysis.txt   # Failure analysis written each cycle

Writing Good Assertions

  • Test one thing — Each assertion checks a single, concrete property
  • Stay binary — No partial credit; pass or fail
  • Prefer structure over quality — Check that sections exist, formats match, and constraints hold rather than judging subjective quality
  • Use regex — Pattern matching handles formatting variation well
  • Name clearlyassert_has_proof_artifacts is better than assert_check_3

Writing Good Test Cases

  • Cover your categories — Include test cases from each domain the prompt should handle
  • Vary complexity — Mix simple and complex inputs
  • Include edge cases — Add cases where the prompt is likely to fail
  • Use realistic inputs — The closer to real usage, the more useful the optimization

Non-Determinism

In prompt mode, Claude's responses vary between runs, even with identical prompts and inputs. This means pass rates will fluctuate — a prompt scoring 80% on one run might score 70% or 90% on the next.

With a sufficiently large test suite (15+ cases), individual variance tends to average out across the suite, making overall pass rates relatively stable. However, small differences between candidates (e.g., 75% vs 78%) may not be meaningful.

Tips for working with this:

  • Don't over-index on small margins — A 2-3% difference could be noise
  • Use more test cases — Larger suites produce more stable results
  • Look at assertion patterns, not just totals — If a candidate consistently fixes a specific assertion across multiple test cases, that's a real signal even if the overall pass rate is close

In custom runner mode, results may be more deterministic (e.g., test execution time is measurable), but external factors (system load, caching) can still introduce variance.

About

Claude Code plugin for iterative optimization. Evaluates a prompt, config, or code file against a test suite, analyzes failures, generates targeted variants, and promotes winners until the target pass rate is reached.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors