Skip to content

eval-protocol/SWEBench-Eval

Repository files navigation

SWE-bench Evaluation with eval-protocol (Example)

This example shows how to run SWE-bench evaluations using eval-protocol, either against a local server or a remote server at 35.209.134.123.

Overview

This repository provides a complete setup for running SWE-bench evaluations with full observability through eval-protocol's UI. Each evaluation:

  • Runs the agent in isolated Docker containers
  • Captures all LLM calls with tracing
  • Evaluates patches with the official SWE-bench test harness
  • Displays results in a unified dashboard

Quickstart (single install command)

1. Clone the Repository

git clone https://github.com/your-org/swebench-eval-protocol
cd swebench-eval-protocol
cd /Users/shrey/Documents/SWEBench-Eval
uv sync && source .venv/bin/activate

Set API Keys

export FIREWORKS_API_KEY="<your_fireworks_key>"
export FIREWORKS_ACCOUNT_ID="<your_fireworks_account_id>"

Option A: Run locally (server and test on your machine)

  1. Start the server (FastAPI) in one terminal:
uv run python server.py
  1. Run the evaluation test in another terminal:
uv run pytest -q test_swebench.py -s

When running locally make sure you have docker installed. SWE-bench tests run inside docker containers.

Option B: Run against the remote server (35.209.134.123)

On the remote VM (first time):

On your local machine, point the test to the remote server. Edit test_swebench.py and set:

rollout_processor=RemoteRolloutProcessor(
    remote_base_url="http://35.209.134.123:3000",
    model_base_url="https://tracing.fireworks.ai",
    timeout_seconds=1800,
),

Then run:

cd /Users/shrey/Documents/SWEBench-Eval
uv run pytest -q test_swebench.py -s

Configuration

Edit test_swebench.py to configure model and concurrency:

completion_params=[{
    "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
}],
max_concurrent_rollouts=3,  # How many instances in parallel

To test more instances, change:

def rows():
    return create_rows_from_indices(10)  # Change from 2 to 10 (max 500)

Outputs and logs

Each evaluation creates an isolated directory:

invocation-abc-123/
  row_0/
    preds.json                             # Generated patch
    astropy__astropy-12907/
      astropy__astropy-12907.traj.json     # Agent execution trace
    logs/run_evaluation/.../report.json    # Test results (pass/fail)
    agent_0.log                             # Agent console output (captured by server)
  row_1/
    ...

Key files:

  • preds.json - Patch generated by the model
  • report.json - Test results (resolved: true = instance solved)
  • *.traj.json - Complete agent trajectory
  • exit_statuses_*.yaml - Why runs failed (if applicable)

Performance

  • 2 instances: ~10-30 minutes
  • 10 instances: ~2-5 hours
  • 500 instances (full verified): 24-48 hours on 16-core machine

License

MIT

See Also

About

SWEBench Integration with eval-protocol

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages