SWE-bench Evaluation with eval-protocol (Example)

This example shows how to run SWE-bench evaluations using eval-protocol, either against a local server or a remote server at 35.209.134.123.

Overview

This repository provides a complete setup for running SWE-bench evaluations with full observability through eval-protocol's UI. Each evaluation:

Runs the agent in isolated Docker containers
Captures all LLM calls with tracing
Evaluates patches with the official SWE-bench test harness
Displays results in a unified dashboard

Quickstart (single install command)

1. Clone the Repository

git clone https://github.com/your-org/swebench-eval-protocol
cd swebench-eval-protocol

cd /Users/shrey/Documents/SWEBench-Eval
uv sync && source .venv/bin/activate

Set API Keys

export FIREWORKS_API_KEY="<your_fireworks_key>"
export FIREWORKS_ACCOUNT_ID="<your_fireworks_account_id>"

Option A: Run locally (server and test on your machine)

Start the server (FastAPI) in one terminal:

uv run python server.py

Run the evaluation test in another terminal:

uv run pytest -q test_swebench.py -s

When running locally make sure you have docker installed. SWE-bench tests run inside docker containers.

Option B: Run against the remote server (35.209.134.123)

On the remote VM (first time):

On your local machine, point the test to the remote server. Edit test_swebench.py and set:

rollout_processor=RemoteRolloutProcessor(
    remote_base_url="http://35.209.134.123:3000",
    model_base_url="https://tracing.fireworks.ai",
    timeout_seconds=1800,
),

Then run:

cd /Users/shrey/Documents/SWEBench-Eval
uv run pytest -q test_swebench.py -s

Configuration

Edit test_swebench.py to configure model and concurrency:

completion_params=[{
    "model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
}],
max_concurrent_rollouts=3,  # How many instances in parallel

To test more instances, change:

def rows():
    return create_rows_from_indices(10)  # Change from 2 to 10 (max 500)

Outputs and logs

Each evaluation creates an isolated directory:

invocation-abc-123/
  row_0/
    preds.json                             # Generated patch
    astropy__astropy-12907/
      astropy__astropy-12907.traj.json     # Agent execution trace
    logs/run_evaluation/.../report.json    # Test results (pass/fail)
    agent_0.log                             # Agent console output (captured by server)
  row_1/
    ...

Key files:

preds.json - Patch generated by the model
report.json - Test results (resolved: true = instance solved)
*.traj.json - Complete agent trajectory
exit_statuses_*.yaml - Why runs failed (if applicable)

Performance

2 instances: ~10-30 minutes
10 instances: ~2-5 hours
500 instances (full verified): 24-48 hours on 16-core machine

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
SWE-bench @ 5cd4be9		SWE-bench @ 5cd4be9
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml
run_swe_agent_fw.py		run_swe_agent_fw.py
server.py		server.py
test_swebench.py		test_swebench.py
tracing_model.py		tracing_model.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE-bench Evaluation with eval-protocol (Example)

Overview

Quickstart (single install command)

1. Clone the Repository

Set API Keys

Option A: Run locally (server and test on your machine)

Option B: Run against the remote server (35.209.134.123)

Configuration

Outputs and logs

Performance

License

See Also

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SWE-bench Evaluation with eval-protocol (Example)

Overview

Quickstart (single install command)

1. Clone the Repository

Set API Keys

Option A: Run locally (server and test on your machine)

Option B: Run against the remote server (35.209.134.123)

Configuration

Outputs and logs

Performance

License

See Also

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages