Skip to content

blocksecteam/ReEVMBench

Repository files navigation

Re-Evaluating EVMBench

This repo contains the dataset and code for the paper "Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?".

Parts of the evaluation data and supporting scripts are adapted from OpenAI Frontier Evals.

Setup

All commands in this README should be run from the root of the project.

Install the required dependencies:

uv sync

Base Images

The evmbench base image depends on the local ploit-builder:latest image. Build it once before running any audit builds:

docker build -f ploit/Dockerfile -t ploit-builder:latest --target ploit-builder .

Audit Docker Images

Each audit uses its own Docker image. To build these, run:

uv run docker_build.py --split all

To tag images for a remote registry (useful for running on a compute fleet), pass a fully qualified repo name:

uv run docker_build.py --tag-prefix registry.example.com/evmbench/audit --split all

This will build the images for all tasks. The patch-tasks and exploit-tasks splits, for the patch and exploit evaluation modes, are a subset of the detect-tasks.

If a build failure occurred, or to build an individual audit, you can run:

uv run docker_build.py --no-build-base --audit <audit_id>

The build script assumes ploit-builder:latest already exists locally.

Environment variables

Non-human agents require API credentials.

In this repo, the default setup is to route model calls through OpenRouter, not the individual provider-hosted APIs. In practice, this means you usually only need to provide OPENROUTER_API_KEY.

You can set it in either of these ways:

  • export OPENROUTER_API_KEY in your shell before running commands
  • edit run.sh locally and set your own OPENROUTER_API_KEY

Environment variables from your shell always override the defaults in run.sh.

Notes by agent family:

  • OPENROUTER_API_KEY is the main key you should configure for this repo
  • OPENAI_API_KEY, ANTHROPIC_API_KEY, and GEMINI_API_KEY are only needed if you intentionally switch to the providers' official endpoints instead of OpenRouter
  • OPENAI_API_KEY may also be populated indirectly from OPENROUTER_API_KEY by the local wrapper scripts for OpenRouter-backed Codex flows

Running the Eval

run.sh wrapper

run.sh is a thin convenience wrapper around evmbench.nano.entrypoint.

Common examples:

./run.sh
./run.sh 2024-06-thorchain
MODE=patch ./run.sh 2024-06-thorchain
MODE=detect-incident ./run.sh
MODE=detect-incident ./run.sh ch-0025
EVMBENCH_ETH_RPC_URL=... MODE=exploit-incident ./run.sh ch-0020
EVMBENCH_BASE_RPC_URL=... MODE=exploit-incident ./run.sh ch-0025

For ClaraHacks image builds, use:

Exploit incidents resolve RPCs by chain from EVMBENCH_ETH_RPC_URL, EVMBENCH_BSC_RPC_URL, or EVMBENCH_BASE_RPC_URL (and the corresponding fallback env vars used by the templates).

EVMBENCH_AUDITS_DIR=$PWD/clarahacks/clarahacks \
EVMBENCH_SPLITS_DIR=$PWD/clarahacks \
uv run python docker_build.py --split clarahacks-detect

To see all configurable options, run:

uv run python -m evmbench.nano.entrypoint --help

Quickstart

The human agent is a no-op and is useful for verifying the harness + Docker setup end-to-end.

For a fast first run, use the debug split (1 audit):

uv run python -m evmbench.nano.entrypoint \
    evmbench.audit_split=debug \
    evmbench.mode=detect \
    evmbench.apply_gold_solution=False \
    evmbench.log_to_run_dir=True \
    evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
    evmbench.solver.agent_id=human \

Useful configuration settings

  • Modes: evmbench.mode=detect|patch|exploit
  • Select a single audit: evmbench.audit=<audit_id>
  • Select a split: evmbench.audit_split=debug|detect-tasks|patch-tasks|exploit-tasks (see splits/)
  • Hints: evmbench.hint_level=none|low|med|high|max
  • Gold solution: evmbench.apply_gold_solution=True should score full points in each mode.
  • Concurrency: runner.concurrency=<int> (default 5)

What gets graded

At the end of the agent rollout, the harness extracts submission/ from the agent container and grades via:

  • detect: submission/audit.md (audit report)
  • patch: submission/agent.diff (unified diff against the base commit)
  • exploit: submission/txs.json (transactions to execute)

Choosing an agent

Set evmbench.solver.agent_id to one of the IDs defined in evmbench/agents/**/config.yaml (the YAML top-level keys).

Examples: human, codex-default, claude-default, gemini-default, opencode-default.

Example: run a real agent on one audit

After setting the relevant API key(s) (see Environment variables above), you can run e.g. Codex in patch mode:

uv run python -m evmbench.nano.entrypoint \
    evmbench.audit=2024-04-noya \
    evmbench.mode=patch \
    evmbench.hint_level=none \
    evmbench.log_to_run_dir=True \
    evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
    evmbench.solver.agent_id=codex-default \
    runner.concurrency=1 \

To add a new agent, see evmbench/agents (each agent has a config.yaml + a start.sh).

Results and logs

By default, outputs go under runs/. Each invocation creates a run-group directory containing group.log, and per-audit run log files.

License

Licensed under the Apache License, Version 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors