Awesome LLM Bench

Top 10 of the most reliable LLM leaderboards, auto-synced daily.

Last sync: **2026-05-01** (UTC, daily auto-update)

Data source: benchlm.ai. For the full leaderboards (43+ models per benchmark), pricing dashboards, and methodology, please visit the canonical site. This repository is a Top-10 mirror with attribution, not a replacement.

About

The LLM evaluation landscape is noisy. LMArena measures preference, not capability; vendor-published numbers are cherry-picked; most aggregators lag months behind frontier model releases. benchlm.ai is the most honest, frequently-updated aggregator I have found. This repository distills the Top 10 of each high-signal benchmark for fast scanning, paired with a curated AI coding-tools landscape that benchlm.ai does not cover.

Coding

SWE-bench Verified

Real GitHub issues from popular Python repositories (Django, Flask, scikit-learn). Human-verified subset of SWE-bench. The gold standard for AI coding agents.

Rank	Model	Provider	License	Score
1	Claude Mythos Preview	Anthropic	Closed	93.9%
2	Claude Opus 4.7 (Adaptive)	Anthropic	Closed	87.6%
3	GPT-5.3 Codex	OpenAI	Closed	85.0%
4	Claude Opus 4.5	Anthropic	Closed	80.9%
5	Claude Opus 4.6	Anthropic	Closed	80.8%
6	DeepSeek V4 Pro (Max)	DeepSeek	Open	80.6%
7	Kimi K2.6	Moonshot AI	Open	80.2%
8	GPT-5.2	OpenAI	Closed	80.0%
9	Claude Sonnet 4.6	Anthropic	Closed	79.6%
10	DeepSeek V4 Pro (High)	DeepSeek	Open	79.4%

Source: https://benchlm.ai/benchmarks/sweVerified · Updated 2026-04-30 · Total models: 44

LiveCodeBench

Contamination-free code generation. Fresh problems are sampled continuously, mitigating training-data leakage.

Rank	Model	Provider	License	Score
1	DeepSeek V4 Pro (Max)	DeepSeek	Open	93.5%
2	DeepSeek V4 Flash (Max)	DeepSeek	Open	91.6%
3	DeepSeek V4 Pro (High)	DeepSeek	Open	89.8%
4	Kimi K2.6	Moonshot AI	Open	89.6%
5	DeepSeek V4 Flash (High)	DeepSeek	Open	88.4%
6	Kimi K2.5	Moonshot AI	Open	85.0%
7	GLM-4.7	Z.AI	Open	84.9%
8	Qwen3.6-27B	Alibaba	Open	83.9%
9	Qwen3.6-35B-A3B	Alibaba	Open	80.4%
10	Nemotron 3 Nano Omni 30B A3B	NVIDIA	Open	63.2%

Source: https://benchlm.ai/benchmarks/liveCodeBench · Updated 2026-04-30 · Total models: 13

Agentic

Terminal-Bench 2.0

Multi-step terminal and CLI workflows. Models inspect files, run commands, edit code, and recover from errors over interactive sessions.

Rank	Model	Provider	License	Score
1	GPT-5.5	OpenAI	Closed	82.0%
2	Claude Opus 4.7 (Adaptive)	Anthropic	Closed	69.4%
3	MiMo-V2.5-Pro	Xiaomi	Closed	68.4%
4	DeepSeek V4 Pro (Max)	DeepSeek	Open	67.9%
5	Kimi K2.6	Moonshot AI	Open	66.7%
6	MiMo-V2.5	Xiaomi	Closed	65.8%
7	Qwen 3.6 Max (preview)	Alibaba	Closed	65.4%
8	DeepSeek V4 Pro (High)	DeepSeek	Open	63.3%
9	Composer 2	Cursor	Closed	61.7%
10	Qwen3.6-27B	Alibaba	Open	59.3%

Source: https://benchlm.ai/benchmarks/terminalBench2 · Updated 2026-04-30 · Total models: 17

OSWorld-Verified

Computer-use tasks in desktop GUIs. Navigation, editing, and complex multi-step workflows.

Rank	Model	Provider	License	Score
1	Holo3-35B-A3B	H Company	Open	82.6%
2	Claude Mythos Preview	Anthropic	Closed	79.6%
3	Holo3-122B-A10B	H Company	Closed	78.8%
4	GPT-5.5	OpenAI	Closed	78.7%
5	Claude Opus 4.7 (Adaptive)	Anthropic	Closed	78.0%
6	GPT-5.4	OpenAI	Closed	75.0%
7	Kimi K2.6	Moonshot AI	Open	73.1%
8	Claude Opus 4.6	Anthropic	Closed	72.7%
9	Claude Sonnet 4.6	Anthropic	Closed	72.1%
10	GPT-5.4 mini	OpenAI	Closed	72.1%

Source: https://benchlm.ai/benchmarks/osWorldVerified · Updated 2026-04-30 · Total models: 18

BrowseComp

Web-research agents. Models search, inspect sources, gather evidence, and return correct answers to research-oriented questions.

Rank	Model	Provider	License	Score
1	GPT-5.5 Pro	OpenAI	Closed	90.1%
2	GPT-5.4 Pro	OpenAI	Closed	89.3%
3	Claude Mythos Preview	Anthropic	Closed	86.9%
4	GPT-5.5	OpenAI	Closed	84.4%
5	Claude Opus 4.6	Anthropic	Closed	83.7%
6	DeepSeek V4 Pro (Max)	DeepSeek	Open	83.4%
7	Kimi K2.6	Moonshot AI	Open	83.2%
8	GPT-5.4	OpenAI	Closed	82.7%
9	DeepSeek V4 Pro (High)	DeepSeek	Open	80.4%
10	Claude Opus 4.7 (Adaptive)	Anthropic	Closed	79.3%

Source: https://benchlm.ai/benchmarks/browseComp · Updated 2026-04-30 · Total models: 21

Reasoning

ARC-AGI-2

Abstraction and reasoning grid puzzles. A frontier general-intelligence test where humans solve nearly all tasks but models struggle.

Rank	Model	Provider	License	Score
1	GPT-5.5	OpenAI	Closed	85.0%
2	GPT-5.4 Pro	OpenAI	Closed	83.3%
3	Gemini 3.1 Pro	Google	Closed	77.1%
4	Claude Opus 4.7 (Adaptive)	Anthropic	Closed	75.8%
5	Grok 4.20	xAI	Closed	53.3%
6	GPT-5.2	OpenAI	Closed	52.9%
7	Gemini 3 Pro Deep Think	Google	Closed	45.1%
8	Muse Spark	Meta	Closed	42.5%
9	Gemini 3 Pro	Google	Closed	31.1%
10	Claude Sonnet 4.5	Anthropic	Closed	13.6%

Source: https://benchlm.ai/benchmarks/arcAgi2 · Updated 2026-04-30 · Total models: 10

Knowledge

Humanity's Last Exam

Expert-level questions across all academic domains. Designed to be hard for frontier models.

Rank	Model	Provider	License	Score
1	Claude Mythos Preview	Anthropic	Closed	64.7%
2	GPT-5.4 Pro	OpenAI	Closed	58.7%
3	GPT-5.5 Pro	OpenAI	Closed	57.2%
4	Claude Opus 4.7 (Adaptive)	Anthropic	Closed	54.7%
5	Claude Opus 4.6	Anthropic	Closed	53.0%
6	GLM-5.1	Z.AI	Open	52.3%
7	GPT-5.5	OpenAI	Closed	52.2%
8	GPT-5.4	OpenAI	Closed	52.1%
9	GLM-5	Z.AI	Open	50.4%
10	Muse Spark	Meta	Closed	50.4%

Source: https://benchlm.ai/benchmarks/hle · Updated 2026-04-30 · Total models: 31

AI Coding Tools Landscape

The tools practitioners actually ship code with. Selection bar is high — only tools with verifiable adoption and active maintenance. Full table with criteria, pricing, and update cadence: tools/ai-coding-tools.md.

CLI agents

Tool	Provider	Distinguishing capability
Claude Code	Anthropic	Sub-agents, hooks, MCP, slash commands, skills
Codex CLI	OpenAI	Official agent CLI with sandboxed execution
Gemini CLI	Google	Native Search grounding, generous free tier
Aider	Open source	Git-native diffs, repo-map, model-agnostic

IDE-native

Tool	Provider	Distinguishing capability
Cursor	Anysphere	Composer multi-file edit, fastest Tab completion
Windsurf	Codeium / OpenAI	Cascade flow, supercomplete
Zed AI	Zed Industries	Built into the fastest editor (Rust)
GitHub Copilot	GitHub	Largest deployment, broadest IDE coverage

VS Code extensions (open source, BYOK)

Tool	Provider	Distinguishing capability
Cline	Open source	Plan/Act modes, MCP, browser use
Roo Code	Open source	Cline fork with custom agent modes
Continue	Open source	Customizable assistants and slash commands

Cloud agents and codebase Q&A

Tool	Provider	Distinguishing capability
Devin	Cognition	Long-running autonomous SWE agent
Replit Agent	Replit	End-to-end app generation in browser
Sourcegraph Cody	Sourcegraph	Code-graph context, repo-scale awareness

How to read these numbers

Do not compare across benchmarks. Different scales, different ceilings.
Look at the spread. Top 10 within 2–3 points means saturation; differences are noise. A 10+ point lead means the leader is genuinely ahead.
Check the date. Each table links back to the source page; benchmarks refresh asynchronously.
For your own use case, run your own evaluation. Public benchmarks measure averages on someone else's tasks.

Caveats

benchlm.ai is also an aggregator with judgment calls (category weights, inclusion criteria). I mirror their judgment because it is the best I have found, not because it is objective truth.
Benchmark contamination is real and growing. Treat any single benchmark with skepticism — consensus across multiple is the signal.
Model identity drift: vendors silently update models behind the same name. Scores from different dates are not strictly comparable.

Data source and attribution

All leaderboard data is mirrored from benchlm.ai with full attribution. Each table links back to the canonical page. Excluded by design: benchmarks tagged "Display only" on benchlm.ai itself (GAIA, BFCL v4, FrontierMath, …) — they have incomplete public snapshots and including them would mislead.

For full leaderboards, pricing, methodology, dashboards, and category weights, please visit benchlm.ai.

Update cadence

A GitHub Actions workflow runs daily at 02:00 UTC, fetches the source pages, parses the leaderboard, and commits to data/ and the README sections only when something has changed. The commit message names what changed. See .github/workflows/sync.yml.

Contributing

PRs welcome — see CONTRIBUTING.md. Add a benchmark by editing scripts/benchmarks.yaml; add a tool by editing tools/ai-coding-tools.md. Keep the bar high: only Current or Refreshing benchmarks on benchlm.ai, only tools with real adoption.

License

MIT for the curation, code, and original commentary. Leaderboard data is mirrored from benchlm.ai — see their terms for data use.

_{Maintained by @leoncuhk · Sister project: awesome-quant-ai}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
data		data
scripts		scripts
tools		tools
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome LLM Bench

About

Contents

Coding

SWE-bench Verified

LiveCodeBench

Agentic

Terminal-Bench 2.0

OSWorld-Verified

BrowseComp

Reasoning

ARC-AGI-2

Knowledge

Humanity's Last Exam

AI Coding Tools Landscape

CLI agents

IDE-native

VS Code extensions (open source, BYOK)

Cloud agents and codebase Q&A

How to read these numbers

Caveats

Data source and attribution

Update cadence

Contributing

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Bench

About

Contents

Coding

SWE-bench Verified

LiveCodeBench

Agentic

Terminal-Bench 2.0

OSWorld-Verified

BrowseComp

Reasoning

ARC-AGI-2

Knowledge

Humanity's Last Exam

AI Coding Tools Landscape

CLI agents

IDE-native

VS Code extensions (open source, BYOK)

Cloud agents and codebase Q&A

How to read these numbers

Caveats

Data source and attribution

Update cadence

Contributing

Related

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages