Top 10 of the most reliable LLM leaderboards, auto-synced daily.
Last sync: **2026-05-01** (UTC, daily auto-update)Data source: benchlm.ai. For the full leaderboards (43+ models per benchmark), pricing dashboards, and methodology, please visit the canonical site. This repository is a Top-10 mirror with attribution, not a replacement.
The LLM evaluation landscape is noisy. LMArena measures preference, not capability; vendor-published numbers are cherry-picked; most aggregators lag months behind frontier model releases. benchlm.ai is the most honest, frequently-updated aggregator I have found. This repository distills the Top 10 of each high-signal benchmark for fast scanning, paired with a curated AI coding-tools landscape that benchlm.ai does not cover.
Coding — SWE-bench Verified · LiveCodeBench
Agentic — Terminal-Bench 2.0 · OSWorld-Verified · BrowseComp
Reasoning — ARC-AGI-2
Knowledge — Humanity's Last Exam
Tools — AI Coding Tools Landscape
Reference — How to read · Caveats · Attribution
Real GitHub issues from popular Python repositories (Django, Flask, scikit-learn). Human-verified subset of SWE-bench. The gold standard for AI coding agents.
| Rank | Model | Provider | License | Score |
|---|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | Closed | 93.9% |
| 2 | Claude Opus 4.7 (Adaptive) | Anthropic | Closed | 87.6% |
| 3 | GPT-5.3 Codex | OpenAI | Closed | 85.0% |
| 4 | Claude Opus 4.5 | Anthropic | Closed | 80.9% |
| 5 | Claude Opus 4.6 | Anthropic | Closed | 80.8% |
| 6 | DeepSeek V4 Pro (Max) | DeepSeek | Open | 80.6% |
| 7 | Kimi K2.6 | Moonshot AI | Open | 80.2% |
| 8 | GPT-5.2 | OpenAI | Closed | 80.0% |
| 9 | Claude Sonnet 4.6 | Anthropic | Closed | 79.6% |
| 10 | DeepSeek V4 Pro (High) | DeepSeek | Open | 79.4% |
Source: https://benchlm.ai/benchmarks/sweVerified · Updated 2026-04-30 · Total models: 44
Contamination-free code generation. Fresh problems are sampled continuously, mitigating training-data leakage.
| Rank | Model | Provider | License | Score |
|---|---|---|---|---|
| 1 | DeepSeek V4 Pro (Max) | DeepSeek | Open | 93.5% |
| 2 | DeepSeek V4 Flash (Max) | DeepSeek | Open | 91.6% |
| 3 | DeepSeek V4 Pro (High) | DeepSeek | Open | 89.8% |
| 4 | Kimi K2.6 | Moonshot AI | Open | 89.6% |
| 5 | DeepSeek V4 Flash (High) | DeepSeek | Open | 88.4% |
| 6 | Kimi K2.5 | Moonshot AI | Open | 85.0% |
| 7 | GLM-4.7 | Z.AI | Open | 84.9% |
| 8 | Qwen3.6-27B | Alibaba | Open | 83.9% |
| 9 | Qwen3.6-35B-A3B | Alibaba | Open | 80.4% |
| 10 | Nemotron 3 Nano Omni 30B A3B | NVIDIA | Open | 63.2% |
Source: https://benchlm.ai/benchmarks/liveCodeBench · Updated 2026-04-30 · Total models: 13
Multi-step terminal and CLI workflows. Models inspect files, run commands, edit code, and recover from errors over interactive sessions.
| Rank | Model | Provider | License | Score |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | Closed | 82.0% |
| 2 | Claude Opus 4.7 (Adaptive) | Anthropic | Closed | 69.4% |
| 3 | MiMo-V2.5-Pro | Xiaomi | Closed | 68.4% |
| 4 | DeepSeek V4 Pro (Max) | DeepSeek | Open | 67.9% |
| 5 | Kimi K2.6 | Moonshot AI | Open | 66.7% |
| 6 | MiMo-V2.5 | Xiaomi | Closed | 65.8% |
| 7 | Qwen 3.6 Max (preview) | Alibaba | Closed | 65.4% |
| 8 | DeepSeek V4 Pro (High) | DeepSeek | Open | 63.3% |
| 9 | Composer 2 | Cursor | Closed | 61.7% |
| 10 | Qwen3.6-27B | Alibaba | Open | 59.3% |
Source: https://benchlm.ai/benchmarks/terminalBench2 · Updated 2026-04-30 · Total models: 17
Computer-use tasks in desktop GUIs. Navigation, editing, and complex multi-step workflows.
| Rank | Model | Provider | License | Score |
|---|---|---|---|---|
| 1 | Holo3-35B-A3B | H Company | Open | 82.6% |
| 2 | Claude Mythos Preview | Anthropic | Closed | 79.6% |
| 3 | Holo3-122B-A10B | H Company | Closed | 78.8% |
| 4 | GPT-5.5 | OpenAI | Closed | 78.7% |
| 5 | Claude Opus 4.7 (Adaptive) | Anthropic | Closed | 78.0% |
| 6 | GPT-5.4 | OpenAI | Closed | 75.0% |
| 7 | Kimi K2.6 | Moonshot AI | Open | 73.1% |
| 8 | Claude Opus 4.6 | Anthropic | Closed | 72.7% |
| 9 | Claude Sonnet 4.6 | Anthropic | Closed | 72.1% |
| 10 | GPT-5.4 mini | OpenAI | Closed | 72.1% |
Source: https://benchlm.ai/benchmarks/osWorldVerified · Updated 2026-04-30 · Total models: 18
Web-research agents. Models search, inspect sources, gather evidence, and return correct answers to research-oriented questions.
| Rank | Model | Provider | License | Score |
|---|---|---|---|---|
| 1 | GPT-5.5 Pro | OpenAI | Closed | 90.1% |
| 2 | GPT-5.4 Pro | OpenAI | Closed | 89.3% |
| 3 | Claude Mythos Preview | Anthropic | Closed | 86.9% |
| 4 | GPT-5.5 | OpenAI | Closed | 84.4% |
| 5 | Claude Opus 4.6 | Anthropic | Closed | 83.7% |
| 6 | DeepSeek V4 Pro (Max) | DeepSeek | Open | 83.4% |
| 7 | Kimi K2.6 | Moonshot AI | Open | 83.2% |
| 8 | GPT-5.4 | OpenAI | Closed | 82.7% |
| 9 | DeepSeek V4 Pro (High) | DeepSeek | Open | 80.4% |
| 10 | Claude Opus 4.7 (Adaptive) | Anthropic | Closed | 79.3% |
Source: https://benchlm.ai/benchmarks/browseComp · Updated 2026-04-30 · Total models: 21
Abstraction and reasoning grid puzzles. A frontier general-intelligence test where humans solve nearly all tasks but models struggle.
| Rank | Model | Provider | License | Score |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | Closed | 85.0% |
| 2 | GPT-5.4 Pro | OpenAI | Closed | 83.3% |
| 3 | Gemini 3.1 Pro | Closed | 77.1% | |
| 4 | Claude Opus 4.7 (Adaptive) | Anthropic | Closed | 75.8% |
| 5 | Grok 4.20 | xAI | Closed | 53.3% |
| 6 | GPT-5.2 | OpenAI | Closed | 52.9% |
| 7 | Gemini 3 Pro Deep Think | Closed | 45.1% | |
| 8 | Muse Spark | Meta | Closed | 42.5% |
| 9 | Gemini 3 Pro | Closed | 31.1% | |
| 10 | Claude Sonnet 4.5 | Anthropic | Closed | 13.6% |
Source: https://benchlm.ai/benchmarks/arcAgi2 · Updated 2026-04-30 · Total models: 10
Expert-level questions across all academic domains. Designed to be hard for frontier models.
| Rank | Model | Provider | License | Score |
|---|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | Closed | 64.7% |
| 2 | GPT-5.4 Pro | OpenAI | Closed | 58.7% |
| 3 | GPT-5.5 Pro | OpenAI | Closed | 57.2% |
| 4 | Claude Opus 4.7 (Adaptive) | Anthropic | Closed | 54.7% |
| 5 | Claude Opus 4.6 | Anthropic | Closed | 53.0% |
| 6 | GLM-5.1 | Z.AI | Open | 52.3% |
| 7 | GPT-5.5 | OpenAI | Closed | 52.2% |
| 8 | GPT-5.4 | OpenAI | Closed | 52.1% |
| 9 | GLM-5 | Z.AI | Open | 50.4% |
| 10 | Muse Spark | Meta | Closed | 50.4% |
Source: https://benchlm.ai/benchmarks/hle · Updated 2026-04-30 · Total models: 31
The tools practitioners actually ship code with. Selection bar is high — only tools with verifiable adoption and active maintenance. Full table with criteria, pricing, and update cadence: tools/ai-coding-tools.md.
| Tool | Provider | Distinguishing capability |
|---|---|---|
| Claude Code | Anthropic | Sub-agents, hooks, MCP, slash commands, skills |
| Codex CLI | OpenAI | Official agent CLI with sandboxed execution |
| Gemini CLI | Native Search grounding, generous free tier | |
| Aider | Open source | Git-native diffs, repo-map, model-agnostic |
| Tool | Provider | Distinguishing capability |
|---|---|---|
| Cursor | Anysphere | Composer multi-file edit, fastest Tab completion |
| Windsurf | Codeium / OpenAI | Cascade flow, supercomplete |
| Zed AI | Zed Industries | Built into the fastest editor (Rust) |
| GitHub Copilot | GitHub | Largest deployment, broadest IDE coverage |
| Tool | Provider | Distinguishing capability |
|---|---|---|
| Cline | Open source | Plan/Act modes, MCP, browser use |
| Roo Code | Open source | Cline fork with custom agent modes |
| Continue | Open source | Customizable assistants and slash commands |
| Tool | Provider | Distinguishing capability |
|---|---|---|
| Devin | Cognition | Long-running autonomous SWE agent |
| Replit Agent | Replit | End-to-end app generation in browser |
| Sourcegraph Cody | Sourcegraph | Code-graph context, repo-scale awareness |
- Do not compare across benchmarks. Different scales, different ceilings.
- Look at the spread. Top 10 within 2–3 points means saturation; differences are noise. A 10+ point lead means the leader is genuinely ahead.
- Check the date. Each table links back to the source page; benchmarks refresh asynchronously.
- For your own use case, run your own evaluation. Public benchmarks measure averages on someone else's tasks.
- benchlm.ai is also an aggregator with judgment calls (category weights, inclusion criteria). I mirror their judgment because it is the best I have found, not because it is objective truth.
- Benchmark contamination is real and growing. Treat any single benchmark with skepticism — consensus across multiple is the signal.
- Model identity drift: vendors silently update models behind the same name. Scores from different dates are not strictly comparable.
All leaderboard data is mirrored from benchlm.ai with full attribution. Each table links back to the canonical page. Excluded by design: benchmarks tagged "Display only" on benchlm.ai itself (GAIA, BFCL v4, FrontierMath, …) — they have incomplete public snapshots and including them would mislead.
For full leaderboards, pricing, methodology, dashboards, and category weights, please visit benchlm.ai.
A GitHub Actions workflow runs daily at 02:00 UTC, fetches the source pages, parses the leaderboard, and commits to data/ and the README sections only when something has changed. The commit message names what changed. See .github/workflows/sync.yml.
PRs welcome — see CONTRIBUTING.md. Add a benchmark by editing scripts/benchmarks.yaml; add a tool by editing tools/ai-coding-tools.md. Keep the bar high: only Current or Refreshing benchmarks on benchlm.ai, only tools with real adoption.
- benchlm.ai — canonical source
- Awesome Quant AI — sister list
- Artificial Analysis — alternative aggregator (price/perf focus)
- LMArena — pairwise human preference
MIT for the curation, code, and original commentary. Leaderboard data is mirrored from benchlm.ai — see their terms for data use.