Re-Evaluating EVMBench

This repo contains the dataset and code for the paper "Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?".

Parts of the evaluation data and supporting scripts are adapted from OpenAI Frontier Evals.

Setup

All commands in this README should be run from the root of the project.

Install the required dependencies:

uv sync

Base Images

The evmbench base image depends on the local ploit-builder:latest image. Build it once before running any audit builds:

docker build -f ploit/Dockerfile -t ploit-builder:latest --target ploit-builder .

Audit Docker Images

Each audit uses its own Docker image. To build these, run:

uv run docker_build.py --split all

To tag images for a remote registry (useful for running on a compute fleet), pass a fully qualified repo name:

uv run docker_build.py --tag-prefix registry.example.com/evmbench/audit --split all

This will build the images for all tasks. The patch-tasks and exploit-tasks splits, for the patch and exploit evaluation modes, are a subset of the detect-tasks.

If a build failure occurred, or to build an individual audit, you can run:

uv run docker_build.py --no-build-base --audit <audit_id>

The build script assumes ploit-builder:latest already exists locally.

Environment variables

Non-human agents require API credentials.

In this repo, the default setup is to route model calls through OpenRouter, not the individual provider-hosted APIs. In practice, this means you usually only need to provide OPENROUTER_API_KEY.

You can set it in either of these ways:

export OPENROUTER_API_KEY in your shell before running commands
edit run.sh locally and set your own OPENROUTER_API_KEY

Environment variables from your shell always override the defaults in run.sh.

Notes by agent family:

OPENROUTER_API_KEY is the main key you should configure for this repo
OPENAI_API_KEY, ANTHROPIC_API_KEY, and GEMINI_API_KEY are only needed if you intentionally switch to the providers' official endpoints instead of OpenRouter
OPENAI_API_KEY may also be populated indirectly from OPENROUTER_API_KEY by the local wrapper scripts for OpenRouter-backed Codex flows

Running the Eval

`run.sh` wrapper

run.sh is a thin convenience wrapper around evmbench.nano.entrypoint.

Common examples:

./run.sh
./run.sh 2024-06-thorchain
MODE=patch ./run.sh 2024-06-thorchain
MODE=detect-incident ./run.sh
MODE=detect-incident ./run.sh ch-0025
EVMBENCH_ETH_RPC_URL=... MODE=exploit-incident ./run.sh ch-0020
EVMBENCH_BASE_RPC_URL=... MODE=exploit-incident ./run.sh ch-0025

For ClaraHacks image builds, use:

Exploit incidents resolve RPCs by chain from EVMBENCH_ETH_RPC_URL, EVMBENCH_BSC_RPC_URL, or EVMBENCH_BASE_RPC_URL (and the corresponding fallback env vars used by the templates).

EVMBENCH_AUDITS_DIR=$PWD/clarahacks/clarahacks \
EVMBENCH_SPLITS_DIR=$PWD/clarahacks \
uv run python docker_build.py --split clarahacks-detect

To see all configurable options, run:

uv run python -m evmbench.nano.entrypoint --help

Quickstart

The human agent is a no-op and is useful for verifying the harness + Docker setup end-to-end.

For a fast first run, use the debug split (1 audit):

uv run python -m evmbench.nano.entrypoint \
    evmbench.audit_split=debug \
    evmbench.mode=detect \
    evmbench.apply_gold_solution=False \
    evmbench.log_to_run_dir=True \
    evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
    evmbench.solver.agent_id=human \

Useful configuration settings

Modes: evmbench.mode=detect|patch|exploit
Select a single audit: evmbench.audit=<audit_id>
Select a split: evmbench.audit_split=debug|detect-tasks|patch-tasks|exploit-tasks (see splits/)
Hints: evmbench.hint_level=none|low|med|high|max
Gold solution: evmbench.apply_gold_solution=True should score full points in each mode.
Concurrency: runner.concurrency=<int> (default 5)

What gets graded

At the end of the agent rollout, the harness extracts submission/ from the agent container and grades via:

detect: submission/audit.md (audit report)
patch: submission/agent.diff (unified diff against the base commit)
exploit: submission/txs.json (transactions to execute)

Choosing an agent

Set evmbench.solver.agent_id to one of the IDs defined in evmbench/agents/**/config.yaml (the YAML top-level keys).

Examples: human, codex-default, claude-default, gemini-default, opencode-default.

Example: run a real agent on one audit

After setting the relevant API key(s) (see Environment variables above), you can run e.g. Codex in patch mode:

uv run python -m evmbench.nano.entrypoint \
    evmbench.audit=2024-04-noya \
    evmbench.mode=patch \
    evmbench.hint_level=none \
    evmbench.log_to_run_dir=True \
    evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
    evmbench.solver.agent_id=codex-default \
    runner.concurrency=1 \

To add a new agent, see evmbench/agents (each agent has a config.yaml + a start.sh).

Results and logs

By default, outputs go under runs/. Each invocation creates a run-group directory containing group.log, and per-audit run log files.

License

Licensed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
audits		audits
clarahacks		clarahacks
evmbench		evmbench
experiments/scripts		experiments/scripts
ploit		ploit
scripts		scripts
splits		splits
tests		tests
veto		veto
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker_build.py		docker_build.py
launch.sh		launch.sh
pyproject.toml		pyproject.toml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Re-Evaluating EVMBench

Setup

Base Images

Audit Docker Images

Environment variables

Running the Eval

`run.sh` wrapper

Quickstart

Useful configuration settings

What gets graded

Choosing an agent

Example: run a real agent on one audit

Results and logs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Re-Evaluating EVMBench

Setup

Base Images

Audit Docker Images

Environment variables

Running the Eval

run.sh wrapper

Quickstart

Useful configuration settings

What gets graded

Choosing an agent

Example: run a real agent on one audit

Results and logs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run.sh` wrapper

Packages