This example shows how to run SWE-bench evaluations using eval-protocol, either against a local server or a remote server at 35.209.134.123.
This repository provides a complete setup for running SWE-bench evaluations with full observability through eval-protocol's UI. Each evaluation:
- Runs the agent in isolated Docker containers
- Captures all LLM calls with tracing
- Evaluates patches with the official SWE-bench test harness
- Displays results in a unified dashboard
git clone https://github.com/your-org/swebench-eval-protocol
cd swebench-eval-protocolcd /Users/shrey/Documents/SWEBench-Eval
uv sync && source .venv/bin/activateexport FIREWORKS_API_KEY="<your_fireworks_key>"
export FIREWORKS_ACCOUNT_ID="<your_fireworks_account_id>"- Start the server (FastAPI) in one terminal:
uv run python server.py- Run the evaluation test in another terminal:
uv run pytest -q test_swebench.py -sWhen running locally make sure you have docker installed. SWE-bench tests run inside docker containers.
On the remote VM (first time):
On your local machine, point the test to the remote server. Edit test_swebench.py and set:
rollout_processor=RemoteRolloutProcessor(
remote_base_url="http://35.209.134.123:3000",
model_base_url="https://tracing.fireworks.ai",
timeout_seconds=1800,
),Then run:
cd /Users/shrey/Documents/SWEBench-Eval
uv run pytest -q test_swebench.py -sEdit test_swebench.py to configure model and concurrency:
completion_params=[{
"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b",
}],
max_concurrent_rollouts=3, # How many instances in parallelTo test more instances, change:
def rows():
return create_rows_from_indices(10) # Change from 2 to 10 (max 500)Each evaluation creates an isolated directory:
invocation-abc-123/
row_0/
preds.json # Generated patch
astropy__astropy-12907/
astropy__astropy-12907.traj.json # Agent execution trace
logs/run_evaluation/.../report.json # Test results (pass/fail)
agent_0.log # Agent console output (captured by server)
row_1/
...
Key files:
preds.json- Patch generated by the modelreport.json- Test results (resolved: true= instance solved)*.traj.json- Complete agent trajectoryexit_statuses_*.yaml- Why runs failed (if applicable)
- 2 instances: ~10-30 minutes
- 10 instances: ~2-5 hours
- 500 instances (full verified): 24-48 hours on 16-core machine
MIT