Skip to content

leoncuhk/awesome-llm-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome LLM Bench

Top 10 of the most reliable LLM leaderboards, auto-synced daily.

Sync License 中文

Last sync: **2026-05-01** (UTC, daily auto-update)

Data source: benchlm.ai. For the full leaderboards (43+ models per benchmark), pricing dashboards, and methodology, please visit the canonical site. This repository is a Top-10 mirror with attribution, not a replacement.


About

The LLM evaluation landscape is noisy. LMArena measures preference, not capability; vendor-published numbers are cherry-picked; most aggregators lag months behind frontier model releases. benchlm.ai is the most honest, frequently-updated aggregator I have found. This repository distills the Top 10 of each high-signal benchmark for fast scanning, paired with a curated AI coding-tools landscape that benchlm.ai does not cover.



Contents

CodingSWE-bench Verified · LiveCodeBench
AgenticTerminal-Bench 2.0 · OSWorld-Verified · BrowseComp
ReasoningARC-AGI-2
KnowledgeHumanity's Last Exam
ToolsAI Coding Tools Landscape
ReferenceHow to read · Caveats · Attribution



Coding

SWE-bench Verified

Real GitHub issues from popular Python repositories (Django, Flask, scikit-learn). Human-verified subset of SWE-bench. The gold standard for AI coding agents.

Rank Model Provider License Score
1 Claude Mythos Preview Anthropic Closed 93.9%
2 Claude Opus 4.7 (Adaptive) Anthropic Closed 87.6%
3 GPT-5.3 Codex OpenAI Closed 85.0%
4 Claude Opus 4.5 Anthropic Closed 80.9%
5 Claude Opus 4.6 Anthropic Closed 80.8%
6 DeepSeek V4 Pro (Max) DeepSeek Open 80.6%
7 Kimi K2.6 Moonshot AI Open 80.2%
8 GPT-5.2 OpenAI Closed 80.0%
9 Claude Sonnet 4.6 Anthropic Closed 79.6%
10 DeepSeek V4 Pro (High) DeepSeek Open 79.4%

Source: https://benchlm.ai/benchmarks/sweVerified · Updated 2026-04-30 · Total models: 44

LiveCodeBench

Contamination-free code generation. Fresh problems are sampled continuously, mitigating training-data leakage.

Rank Model Provider License Score
1 DeepSeek V4 Pro (Max) DeepSeek Open 93.5%
2 DeepSeek V4 Flash (Max) DeepSeek Open 91.6%
3 DeepSeek V4 Pro (High) DeepSeek Open 89.8%
4 Kimi K2.6 Moonshot AI Open 89.6%
5 DeepSeek V4 Flash (High) DeepSeek Open 88.4%
6 Kimi K2.5 Moonshot AI Open 85.0%
7 GLM-4.7 Z.AI Open 84.9%
8 Qwen3.6-27B Alibaba Open 83.9%
9 Qwen3.6-35B-A3B Alibaba Open 80.4%
10 Nemotron 3 Nano Omni 30B A3B NVIDIA Open 63.2%

Source: https://benchlm.ai/benchmarks/liveCodeBench · Updated 2026-04-30 · Total models: 13



Agentic

Terminal-Bench 2.0

Multi-step terminal and CLI workflows. Models inspect files, run commands, edit code, and recover from errors over interactive sessions.

Rank Model Provider License Score
1 GPT-5.5 OpenAI Closed 82.0%
2 Claude Opus 4.7 (Adaptive) Anthropic Closed 69.4%
3 MiMo-V2.5-Pro Xiaomi Closed 68.4%
4 DeepSeek V4 Pro (Max) DeepSeek Open 67.9%
5 Kimi K2.6 Moonshot AI Open 66.7%
6 MiMo-V2.5 Xiaomi Closed 65.8%
7 Qwen 3.6 Max (preview) Alibaba Closed 65.4%
8 DeepSeek V4 Pro (High) DeepSeek Open 63.3%
9 Composer 2 Cursor Closed 61.7%
10 Qwen3.6-27B Alibaba Open 59.3%

Source: https://benchlm.ai/benchmarks/terminalBench2 · Updated 2026-04-30 · Total models: 17

OSWorld-Verified

Computer-use tasks in desktop GUIs. Navigation, editing, and complex multi-step workflows.

Rank Model Provider License Score
1 Holo3-35B-A3B H Company Open 82.6%
2 Claude Mythos Preview Anthropic Closed 79.6%
3 Holo3-122B-A10B H Company Closed 78.8%
4 GPT-5.5 OpenAI Closed 78.7%
5 Claude Opus 4.7 (Adaptive) Anthropic Closed 78.0%
6 GPT-5.4 OpenAI Closed 75.0%
7 Kimi K2.6 Moonshot AI Open 73.1%
8 Claude Opus 4.6 Anthropic Closed 72.7%
9 Claude Sonnet 4.6 Anthropic Closed 72.1%
10 GPT-5.4 mini OpenAI Closed 72.1%

Source: https://benchlm.ai/benchmarks/osWorldVerified · Updated 2026-04-30 · Total models: 18

BrowseComp

Web-research agents. Models search, inspect sources, gather evidence, and return correct answers to research-oriented questions.

Rank Model Provider License Score
1 GPT-5.5 Pro OpenAI Closed 90.1%
2 GPT-5.4 Pro OpenAI Closed 89.3%
3 Claude Mythos Preview Anthropic Closed 86.9%
4 GPT-5.5 OpenAI Closed 84.4%
5 Claude Opus 4.6 Anthropic Closed 83.7%
6 DeepSeek V4 Pro (Max) DeepSeek Open 83.4%
7 Kimi K2.6 Moonshot AI Open 83.2%
8 GPT-5.4 OpenAI Closed 82.7%
9 DeepSeek V4 Pro (High) DeepSeek Open 80.4%
10 Claude Opus 4.7 (Adaptive) Anthropic Closed 79.3%

Source: https://benchlm.ai/benchmarks/browseComp · Updated 2026-04-30 · Total models: 21



Reasoning

ARC-AGI-2

Abstraction and reasoning grid puzzles. A frontier general-intelligence test where humans solve nearly all tasks but models struggle.

Rank Model Provider License Score
1 GPT-5.5 OpenAI Closed 85.0%
2 GPT-5.4 Pro OpenAI Closed 83.3%
3 Gemini 3.1 Pro Google Closed 77.1%
4 Claude Opus 4.7 (Adaptive) Anthropic Closed 75.8%
5 Grok 4.20 xAI Closed 53.3%
6 GPT-5.2 OpenAI Closed 52.9%
7 Gemini 3 Pro Deep Think Google Closed 45.1%
8 Muse Spark Meta Closed 42.5%
9 Gemini 3 Pro Google Closed 31.1%
10 Claude Sonnet 4.5 Anthropic Closed 13.6%

Source: https://benchlm.ai/benchmarks/arcAgi2 · Updated 2026-04-30 · Total models: 10



Knowledge

Humanity's Last Exam

Expert-level questions across all academic domains. Designed to be hard for frontier models.

Rank Model Provider License Score
1 Claude Mythos Preview Anthropic Closed 64.7%
2 GPT-5.4 Pro OpenAI Closed 58.7%
3 GPT-5.5 Pro OpenAI Closed 57.2%
4 Claude Opus 4.7 (Adaptive) Anthropic Closed 54.7%
5 Claude Opus 4.6 Anthropic Closed 53.0%
6 GLM-5.1 Z.AI Open 52.3%
7 GPT-5.5 OpenAI Closed 52.2%
8 GPT-5.4 OpenAI Closed 52.1%
9 GLM-5 Z.AI Open 50.4%
10 Muse Spark Meta Closed 50.4%

Source: https://benchlm.ai/benchmarks/hle · Updated 2026-04-30 · Total models: 31



AI Coding Tools Landscape

The tools practitioners actually ship code with. Selection bar is high — only tools with verifiable adoption and active maintenance. Full table with criteria, pricing, and update cadence: tools/ai-coding-tools.md.

CLI agents

Tool Provider Distinguishing capability
Claude Code Anthropic Sub-agents, hooks, MCP, slash commands, skills
Codex CLI OpenAI Official agent CLI with sandboxed execution
Gemini CLI Google Native Search grounding, generous free tier
Aider Open source Git-native diffs, repo-map, model-agnostic

IDE-native

Tool Provider Distinguishing capability
Cursor Anysphere Composer multi-file edit, fastest Tab completion
Windsurf Codeium / OpenAI Cascade flow, supercomplete
Zed AI Zed Industries Built into the fastest editor (Rust)
GitHub Copilot GitHub Largest deployment, broadest IDE coverage

VS Code extensions (open source, BYOK)

Tool Provider Distinguishing capability
Cline Open source Plan/Act modes, MCP, browser use
Roo Code Open source Cline fork with custom agent modes
Continue Open source Customizable assistants and slash commands

Cloud agents and codebase Q&A

Tool Provider Distinguishing capability
Devin Cognition Long-running autonomous SWE agent
Replit Agent Replit End-to-end app generation in browser
Sourcegraph Cody Sourcegraph Code-graph context, repo-scale awareness


How to read these numbers

  • Do not compare across benchmarks. Different scales, different ceilings.
  • Look at the spread. Top 10 within 2–3 points means saturation; differences are noise. A 10+ point lead means the leader is genuinely ahead.
  • Check the date. Each table links back to the source page; benchmarks refresh asynchronously.
  • For your own use case, run your own evaluation. Public benchmarks measure averages on someone else's tasks.

Caveats

  • benchlm.ai is also an aggregator with judgment calls (category weights, inclusion criteria). I mirror their judgment because it is the best I have found, not because it is objective truth.
  • Benchmark contamination is real and growing. Treat any single benchmark with skepticism — consensus across multiple is the signal.
  • Model identity drift: vendors silently update models behind the same name. Scores from different dates are not strictly comparable.

Data source and attribution

All leaderboard data is mirrored from benchlm.ai with full attribution. Each table links back to the canonical page. Excluded by design: benchmarks tagged "Display only" on benchlm.ai itself (GAIA, BFCL v4, FrontierMath, …) — they have incomplete public snapshots and including them would mislead.

For full leaderboards, pricing, methodology, dashboards, and category weights, please visit benchlm.ai.



Update cadence

A GitHub Actions workflow runs daily at 02:00 UTC, fetches the source pages, parses the leaderboard, and commits to data/ and the README sections only when something has changed. The commit message names what changed. See .github/workflows/sync.yml.

Contributing

PRs welcome — see CONTRIBUTING.md. Add a benchmark by editing scripts/benchmarks.yaml; add a tool by editing tools/ai-coding-tools.md. Keep the bar high: only Current or Refreshing benchmarks on benchlm.ai, only tools with real adoption.

Related

License

MIT for the curation, code, and original commentary. Leaderboard data is mirrored from benchlm.ai — see their terms for data use.


Maintained by @leoncuhk · Sister project: awesome-quant-ai

About

Daily-synced Top 10 LLM leaderboards (SWE-bench Verified, Terminal-Bench, OSWorld, ARC-AGI-2, HLE) from benchlm.ai, plus a curated AI coding tools landscape.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages