LLM Benchmark Suite

Local-first, reproducible benchmarking for LLMs you run yourself.

llm-benchmark runs deterministic task suites against local and OpenAI-compatible inference servers, then writes comparison-friendly JSON, JSONL, CSV, and optional HTML reports. It supports LM Studio, Ollama, llama.cpp, vLLM, SGLang, TensorRT-LLM, TGI, KTransformers, and generic OpenAI-compatible backends with automatic model discovery.

Current state: 104 tasks, 8 scored categories, 19 scoring modes, crash-safe resume, pass@k, deterministic workflow-trace checks, LLM/rubric judging, result diffs, and pairwise arena mode with persisted ELO artifacts.

Why This Exists

Most LLM leaderboards measure proprietary models on curated benchmarks you can't reproduce. This tool runs deterministic, open tasks against models you already have running on your own machine or on an OpenAI-compatible server you control. Results are built for auditability: temperature=0 by default, tolerance-based numeric scoring, strict schema validation, task hashes, versioned JSONL records, execution traces, release/signal metadata, and run-to-run comparison.

At a Glance

Area	Current state
Tasks	104 tasks across math, knowledge, coding, agentic, reasoning, writing, summarization, and instruction-following
Backends	LM Studio, Ollama, llama.cpp, vLLM, SGLang, TensorRT-LLM, TGI, KTransformers, generic OpenAI-compatible
Scoring	19 scoring modes, including exact/numeric/regex/JSON checks, code execution, workflow traces, pass@k, logprob choice, LLM judge, and rubric judge
Reproducibility	Task version/hash tracking, release/signal metadata, opt-in Hugging Face auto-config, dataset dry-run safety, resumable JSONL logs
Outputs	Rich console tables, JSON, CSV, crash-safe JSONL, optional HTML reports, result comparisons, arena ELO JSON

Features

Feature	Details
Multi-backend	LM Studio, Ollama, llama.cpp, vLLM, TGI, SGLang, TensorRT, KTransformers
Auto-discovery	Probes enabled backends, enumerates all available models
HF Auto-Config	Opt-in fetch of Hugging Face `generation_config.json` parameters for tested models
Thinking model support	Qwen3, DeepSeek-R1 — reasoning tokens captured, clean text scored
104 tasks, 8 categories	Math, reasoning, coding, agentic, knowledge, writing, summarization, instruction-following, including vision-language tasks folded into reasoning/writing
19 scoring types	numeric, exact, contains, multi_contains, fuzzy_match, regex, json_keys, json_schema, line_count, code_exec, word_count, contains_n, not_contains, ends_with, logprob_choice, workflow_trace, pass_at_k, llm_judge, rubric_judge
Workflow trace scoring	`workflow_trace` grades ordered tool-call traces, required args, and optional replayed mock state
Few-shot examples	Add `few_shot:` to any task YAML to inject conversation history before the prompt
pass@k coding	`scoring.type: pass_at_k` can run n samples and estimate pass@k with the unbiased Chen et al. (2021) estimator
LLM-as-judge	CoT-then-score protocol — enable with `judge.enabled: true` in config
Run resumption	`--resume` continues from an interrupted run, skipping task-version/content-matched results
Result comparison	`--compare` diffs saved JSON/JSONL runs, including model-level and task-level score movement
Arena artifacts	`--arena` runs pairwise ELO judging and persists leaderboard + match history JSON
Task versioning	`metadata.version` in task YAML propagates to JSONL for audit trails
Execution surfaces	Optional `execution_surface` and `source_signal` tags produce Claw-style surface breakdowns in reports
Composite score	Weighted cross-category score in summary table (coding/math/reasoning weighted higher)
Crash-safe results	Incremental JSONL written after every task — restart safely
Latency histograms	Per-category min/median/p95/max latency surfaced in summary table
CI integration	`--ci-threshold` flag returns exit code 1 when score drops below target
Rich console output	Live per-task scores + summary tables + CSV/JSON export

Requirements

Python 3.11+
At least one inference backend running locally (LM Studio / Ollama / llama.cpp)

Setup

Windows:

cd C:\path\to\llm-benchmark
.\setup.ps1
.\.venv\Scripts\Activate.ps1
llm-bench --dry-run

macOS / Linux:

cd ~/path/to/llm-benchmark
chmod +x setup.sh && ./setup.sh
source .venv/bin/activate
llm-bench --dry-run

Manual editable install, if you created the virtual environment yourself:

pip install -e .
llm-bench --discover

On PowerShell, run.py by itself will not execute from the current directory. Prefer the installed llm-bench command. If you want to run the script file directly, use python .\run.py --dry-run.

Quick Start

Start your backend:
- LM Studio — open app → enable Local Server (Settings → Local Server)
- Ollama — ollama serve (starts automatically on most installs)
- llama.cpp — ./llama-server -m model.gguf --port 8080

Enable the backend in config.yaml:

backends:
  lm_studio:
    enabled: true
    base_url: "http://localhost:1234"
    api_key: "lm-studio"          # placeholder — LM Studio ignores this
  ollama:
    enabled: false
    base_url: "http://localhost:11434"
  llamacpp:
    enabled: false
    base_url: "http://localhost:8080"

Security note: API keys in config.yaml are only placeholders for local servers. For real keys, use environment variables: LLM_BENCH_<BACKEND_NAME>_API_KEY.

Discover available models:
```
llm-bench --discover
```
Run everything:
```
llm-bench
```

All Commands

llm-bench                           # auto-discover + run all 104 tasks
llm-bench --discover                # probe backends, list models, exit
llm-bench --dry-run                 # validate task files + check backends, no inference
llm-bench --model "qwen3:8b"        # single model (all categories)
llm-bench --backend ollama          # restrict to one backend type
llm-bench --category math           # single category
llm-bench --task capital_france     # single task by ID
llm-bench --limit 2                 # first 2 tasks per category (quick smoke test)
llm-bench --resume                  # skip tasks already in the most recent results JSONL
llm-bench --compare old.jsonl new.jsonl # compare two saved result files
llm-bench --compare old.json new.json --compare-top 20 # show more task deltas
llm-bench --no-autoload             # skip LM Studio model-load attempt
llm-bench --allow-code-exec         # enable code_exec scoring (runs generated Python locally)
llm-bench --ci-threshold 0.80       # exit 1 if overall score < 80%  (CI integration)
llm-bench --html-report             # write an interactive HTML report
llm-bench --arena                   # pairwise ELO arena using an LLM judge
llm-bench --output my_results       # custom output directory

Script fallback from the repo root: replace llm-bench with python .\run.py on Windows PowerShell or python run.py on macOS/Linux.

Windows troubleshooting

If PowerShell says run.py exists but is not recognized, that is normal PowerShell command precedence. Use llm-bench --dry-run or python .\run.py --dry-run.

If python .\run.py --dry-run fails with ModuleNotFoundError: No module named 'yaml', your active python is not the repo environment or the package was not installed. From the repo root:

deactivate  # ignore if no environment is active
.\.venv\Scripts\Activate.ps1
python -c "import sys; print(sys.executable)"
python -m pip install -e .
llm-bench --dry-run

Task Categories

Category	Tasks	Scoring Method
`math`	10	`numeric` — extract first number, tolerance-based comparison
`knowledge`	18	`exact` / `contains` / `numeric`
`coding`	15	`code_exec` / `pass_at_k` — runs generated Python, looks for `PASS` in stdout
`agentic`	11	`code_exec` / `json_schema` / `rubric_judge` / `workflow_trace`
`reasoning`	12	`contains` / `exact` / `fuzzy_match` / `word_count` / `llm_judge`
`writing`	13	`line_count` / `regex` / `word_count`
`summarization`	10	`contains` / `line_count` / `regex`
`instruction_following`	15	`exact` / `contains` / `word_count` / `contains_n` / `not_contains` / `ends_with`
Total	104

Scoring Types Reference

Type	Passes when
`numeric`	Extracted number is within `tolerance` of `answer`/`value`
`exact`	Stripped, lowercased response equals expected
`contains`	Response contains expected substring (case-insensitive)
`multi_contains`	Response satisfies all required substring groups
`fuzzy_match`	Bidirectional: `answer ⊆ response` OR `response ⊆ answer` — handles valid paraphrases
`contains_n`	Expected substring appears at least `min_count` times
`not_contains`	None of the `forbidden` strings appear in the response
`ends_with`	Last non-empty line's final word matches `answer`
`word_count`	Response word count falls within `[min, max]`
`regex`	Response matches `pattern`
`json_keys`	Response parses as JSON and contains all `keys`
`json_schema`	Response JSON satisfies lightweight object/array schema checks
`line_count`	Non-empty line count equals `count`
`code_exec`	Generated code block executes and prints `PASS` (needs `--allow-code-exec`)
`logprob_choice`	Highest-probability one-token choice matches `answer`/`value`
`workflow_trace`	JSON `tool_calls`, required args, and final or replayed `state` satisfy deterministic checks
`pass_at_k`	Estimates pass@k over `n`/`samples` independent attempts using `inner_type` scoring
`llm_judge`	Judge LLM scores the response and returns `SCORE: N` (needs `judge.enabled: true`)
`rubric_judge`	Judge LLM scores weighted criteria and returns `RUBRIC_SCORE: N`

Results

Results are saved to the results/ directory after every run:

results/
  results_20250118_143022.json   # full structured results
  results_20250118_143022.csv    # spreadsheet-friendly export
  results_20250118_143022.jsonl  # incremental crash-safe log (one line per task)
  arena_20250118_143022.json     # arena leaderboard + match history when --arena is used

Example console output:

 ✓  math             arithmetic_addition          score=1.0  1823 t/s  TTFT 42ms
 ~  math             percent_change               score=0.0   Got 12.5, expected 15.0 ±0
 ✓  coding           fizzbuzz_function            score=1.0  1651 t/s
 ✗  reasoning        syllogism_barbara            score=0.0  Missing 'all mortals'

Compare result files

Compare two completed or in-progress runs without starting a backend:

llm-bench --compare results/results_20250118_143022.jsonl results/results_20250119_091500.jsonl

--compare accepts the structured JSON files and the incremental JSONL files. It reports per-model score movement, composite-score movement, unmatched task counts, and the largest task-level deltas. Use --compare-top N to change how many task changes are shown.

Backend Comparison

Backend	Discovery endpoint	Chat Endpoint	Thinking tokens
LM Studio	`GET /v1/models`	`POST /v1/chat/completions`	`delta.reasoning_content`
Ollama	`GET /api/tags`	`POST /v1/chat/completions`	`<think>` tags / `think: true`
llama.cpp	`GET /v1/models`	`POST /v1/chat/completions`	`<think>` tag stripping
vLLM	`GET /v1/models`	`POST /v1/chat/completions`	OpenAI compatible
TGI	`GET /v1/models`	`POST /v1/chat/completions`	OpenAI compatible
SGLang	`GET /v1/models`	`POST /v1/chat/completions`	OpenAI compatible
TensorRT-LLM	`GET /v1/models`	`POST /v1/chat/completions`	OpenAI compatible
KTransformers	`GET /v1/models`	`POST /v1/chat/completions`	OpenAI compatible

Adding Tasks

Tasks are YAML files in tasks/. See tasks/README.md for the full schema.

Quick example:

- id: capital_germany
  category: knowledge
  prompt: "What is the capital of Germany? Answer with one word."
  scoring:
    type: exact
    answer: berlin

Few-shot examples

Add a few_shot: block to any task to inject conversation history before the model sees the actual prompt:

- id: my_task
  category: coding
  prompt: "Write a Python function called `square(n)` that returns n squared."
  few_shot:
    - user: "Write a Python function called `cube(n)` that returns n cubed."
      assistant: |
        def cube(n):
            return n ** 3
  scoring:
    type: code_exec
    test_code: |
      assert square(3) == 9
      print('PASS')

Task versioning

Add a metadata: block to any dict-format task file to version-stamp all its tasks:

metadata:
  version: 2

tasks:
  - id: my_task
    ...

The task_version and task_hash fields appear in every JSONL result record for audit trails. --resume only reuses a cached row when the model id, task id, task version, and task hash all match the current task definition.

pass@k scoring

Run a task multiple times and estimate pass@k:

- id: code_challenge
  category: coding
  prompt: "Write a function called `solve(n)` ..."
  scoring:
    type: pass_at_k
    k: 3        # report pass@3
    n: 5        # optional: draw 5 samples for a better estimator (alias: samples)
    inner_type: code_exec
    test_code: |
      assert solve(5) == 42
      print('PASS')

If n/samples is omitted, the runner uses benchmark.runs_per_task, but never fewer than k.

HF auto-config

By default, runs do not contact Hugging Face. To opt into fetching generation_config.json / config.json for model IDs that contain /, enable:

benchmark:
  hf_auto_config: true

Fetched settings are recorded in result JSON/JSONL as hf_generation_config.

Dataset-driven tasks

Dataset tasks can expand Hugging Face datasets into concrete prompts when the optional datasets extra is installed. Simple --dry-run validation does not expand datasets, avoiding network side effects during schema checks.

Remote dataset code is not trusted by default. If a dataset truly requires it, opt in per task:

dataset:
  name: "example/dataset"
  trust_remote_code: true

LLM-as-judge

For subjective tasks, enable the LLM judge in config.yaml:

judge:
  enabled: true
  model: null   # null = use first discovered model

Then add tasks using llm_judge scoring:

- id: explain_quantum
  category: writing
  prompt: "Explain quantum entanglement in two sentences for a teenager."
  scoring:
    type: llm_judge
    criteria: "Is the explanation accurate, concise, and appropriate for a teenager?"
    reference: "Quantum entanglement means two particles stay linked so measuring one instantly affects the other, no matter how far apart they are."

The judge uses a CoT-then-score protocol: it reasons step by step and then outputs SCORE: N (0–10), which is normalised to 0.0–1.0.

Run resumption

Resume an interrupted benchmark without re-running completed tasks:

llm-bench --resume

Or enable it permanently in config.yaml:

benchmark:
  resume: true

Composite score

The summary table now includes a Composite ★ row — a weighted average across categories where harder categories carry more weight:

Category	Weight
coding	1.5
math	1.2
reasoning	1.2
knowledge	1.0
instruction_following	1.0
summarization	0.8
writing	0.8

Changelog

Accuracy improvements (agent review)

Area	Change
TTFT (reasoning models)	Previously used the first content token; now uses `min(t_first_content, t_first_reasoning)` so models that emit reasoning tokens before content tokens report correct TTFT
TPS accuracy	Returns `None` instead of a word-count proxy when the API does not report `completion_tokens`; prevents misleading cross-model comparisons
Reasoning token counting	Reads `completion_tokens_details.reasoning_tokens` from the API response instead of counting stream chunks
Multi-run all-errors	Metrics are `None` (not `0.0`) when every run in a multi-run task fails
pass@k estimator	Uses the unbiased Chen et al. (2021) estimator instead of a binary any-pass check; allows n > k sampling for better statistical estimates
Empty code block	`code_exec` returns score 0.0 with an explicit message when no Python block is found, instead of executing an empty string
LLM judge prompt injection	Response is wrapped in `<response>` XML delimiters and truncated to 4,000 chars; `reference` answer truncated to 2,000 chars
JSONL `model` field	Now records `model_id` (the actual model name) rather than the backend name
Composite score categories	Lookup is now case-insensitive, preventing uncategorised tasks from being silently weighted 1.0
`line_count` key	Scorer now accepts both `value` and `count` YAML keys (some tasks used `count`)

Bug fixes (live-run validation)

Area	Change
Accent-insensitive `contains`	`_score_contains` now normalises both needle and haystack with NFKD Unicode decomposition before comparison, so e.g. `"brasilia"` matches `"Brasília"`
Sentence-based writing tasks	Tasks that ask for "N sentences" now instruct the model to output one sentence per line, making `line_count` scoring reliable
Accent-insensitive `contains_n` / `ends_with` / `fuzzy_match`	All remaining string-comparison scorers updated to use NFKD normalisation, consistent with `contains`; e.g. `"cafe"` matches `"café"` in all four scorer types
`pass_at_k` n > k sampling	`score_pass_at_k` now reads `k` from `task["scoring"]["k"]` (falling back to `len(run_results)`), enabling over-sampling (`n > k`) with the Chen et al. 2021 unbiased estimator
Version-safe resume	`--resume` now hydrates cached rows into reports and only skips results with matching model id, task id, task version, and task hash
HF auto-config opt-in	Hugging Face generation-config fetches are disabled by default and recorded in result metadata when enabled
Safer dataset expansion	`--dry-run` validates dataset task schema without expansion; Hugging Face `trust_remote_code` defaults to false
Robust `json_keys` parsing	Scoring now scans for valid JSON objects instead of using a greedy brace regex

Contributing

See CONTRIBUTING.md for how to add tasks, backends, and scoring types.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
benchmark		benchmark
tasks		tasks
tests		tests
.gitignore		.gitignore
BACKLOG.md		BACKLOG.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
config.yaml		config.yaml
puzzle.py		puzzle.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
setup.ps1		setup.ps1
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmark Suite

Why This Exists

At a Glance

Features

Requirements

Setup

Quick Start

All Commands

Windows troubleshooting

Task Categories

Scoring Types Reference

Results

Compare result files

Backend Comparison

Adding Tasks

Few-shot examples

Task versioning

pass@k scoring

HF auto-config

Dataset-driven tasks

LLM-as-judge

Run resumption

Composite score

Changelog

Accuracy improvements (agent review)

Bug fixes (live-run validation)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Suite

Why This Exists

At a Glance

Features

Requirements

Setup

Quick Start

All Commands

Windows troubleshooting

Task Categories

Scoring Types Reference

Results

Compare result files

Backend Comparison

Adding Tasks

Few-shot examples

Task versioning

pass@k scoring

HF auto-config

Dataset-driven tasks

LLM-as-judge

Run resumption

Composite score

Changelog

Accuracy improvements (agent review)

Bug fixes (live-run validation)

Contributing

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages