agentelo

Local benchmarking tool for AI coding agents. Run your agent against real GitHub bug-fix challenges, get a Bradley-Terry score, see where it would slot into the snapshot of 148 baseline agents I ran across 6 harnesses.

Public leaderboard is closed. I'm not running a hosted submission server anymore — Stanford / Laude Institute's Terminal-Bench 2.0 + Harbor cover the public-leaderboard problem at a scale a solo student can't match. What's left is still useful: the harness adapters, the challenge corpus, and the baseline snapshot. The CLI now runs everything locally — register, run challenges, score, and rank your agent against the bundled baseline — with no network calls.

What it does

Runs your agent (any harness/model combo supported by @twaldin/harness-ts) on real merged-PR bug fixes from click, fastify, flask, jinja, koa, marshmallow, qs
Scores each run with the original PR's test suite — pass/fail per test, no rubric judgment
Compares your scores pairwise against the 148 bundled baseline agents using Bradley-Terry MLE, gives you an inferred ELO and which baselines your agent would beat

Use it to A/B your own prompt changes, your own harness configs, or a model you suspect is under- or over-rated by the baseline.

Install

npm i -g @twaldin/agentelo

Quickstart

# register a local agent (no network call — just saves identity to ~/.agentelo)
agentelo register --name my-agent --harness opencode --model gpt-5.4

# run a ranked match against a randomly picked challenge from the bundled corpus
agentelo play

# show your local results + inferred ranking against the baseline snapshot
agentelo leaderboard

The first play clones the challenge repo into ~/.agentelo/challenges/. After that, runs are offline.

Baseline snapshot (2026-04-15)

These rankings ship with the CLI and are what your local runs are scored against.

148 agents ranked
41 challenges across 7 repos
6 harnesses: claude-code, codex, aider, swe-agent, opencode, gemini
Bradley-Terry ELO over all pairwise outcomes from ~3.5K verified runs

Rank	Agent	ELO	Win Rate
1	`swe-agent-glm-5`	1887	85%
2	`opencode-glm-5`	1882	85%
3	`opencode-gpt-5.4`	1873	85%
4	`opencode-gpt-5.3-codex`	1861	84%
5	`gemini-gemini-3-flash-preview`	1856	84%

Full rankings, match logs, and the SQLite database are in this repo. Browse the snapshot at tim.waldin.net/agentelo — read-only, no submission.

Where the related work lives

Multi-CLI harness abstraction → harness (Python + TypeScript libraries, 13 adapters)
Fleet orchestration → flt (multi-agent, multi-CLI orchestrator)
Prompt/agent optimization → hone (uses harness as mutator backend)
Harness benchmarking → harness-bench (hold the model fixed, vary the scaffold)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
bin		bin
challenges-active		challenges-active
challenges-qs-temp		challenges-qs-temp
challenges		challenges
core		core
docs		docs
frontend		frontend
test		test
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
agent-desktop.png		agent-desktop.png
agent-mobile.png		agent-mobile.png
agent-new-desktop.png		agent-new-desktop.png
challenge-desktop.png		challenge-desktop.png
compare-desktop.png		compare-desktop.png
compare-mobile.png		compare-mobile.png
docker-compose.yml		docker-compose.yml
lb-desktop-new.png		lb-desktop-new.png
lb-mobile.png		lb-mobile.png
leaderboard-desktop.png		leaderboard-desktop.png
package-lock.json		package-lock.json
package.json		package.json
skip-list.json.tmp		skip-list.json.tmp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agentelo

What it does

Install

Quickstart

Baseline snapshot (2026-04-15)

Where the related work lives

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agentelo

What it does

Install

Quickstart

Baseline snapshot (2026-04-15)

Where the related work lives

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages