Skip to content

Latest commit

 

History

History
90 lines (64 loc) · 2.51 KB

File metadata and controls

90 lines (64 loc) · 2.51 KB

ReasonErrorBench

A benchmark and taxonomy for analyzing reasoning errors in Large Language Models.

arXiv License: MIT

Key Findings

  1. Incorrect responses are 53% longer than correct ones (Cohen's d=0.91, p<0.001)
  2. Error types are domain-specific: Math→Computational, Commonsense→Knowledge, Code→Strategy
  3. Scaling helps but doesn't solve everything: 70B models eliminate some error types but increase others

Quick Stats

Metric Value
Total Problems 200
Model Responses 400
Annotated Errors 52
Error Categories 6
Error Types 15

Installation

pip install -r requirements.txt

Usage

Run Full Pipeline

# 1. Generate problems
python src/generate_problems.py --output data/problems.json

# 2. Collect responses (requires API key)
export GROQ_API_KEY=your_key_here
python src/collect_data.py --problems data/problems.json --output data/traces_raw.json

# 3. Annotate errors
python src/annotate.py --input data/traces_raw.json --output data/annotations_human.json --only-incorrect

# 4. Analyze results
python src/analyze.py --traces data/traces_raw.json --annotations data/annotations_human.json --output results/

# 5. Generate figures
python src/make_figures.py --results results/analysis_results.json --output figures/

Error Taxonomy

Category Code Description
Computational COMP Math/calculation errors
Knowledge KNOW Wrong facts or formulas
Logical LOGIC Invalid reasoning
Comprehension COMPR Misunderstanding problems
Strategy STRAT Wrong approach
Output OUT Formatting issues

Results

Model Accuracy by Domain

Model Math Logic Common Code Overall
LLaMA-8B 50% 30% 22% 2% 26%
LLaMA-70B 68% 42% 22% 6% 34.5%

Response Length Effect

Response Length

Incorrect responses average 353 words vs 231 for correct (p<0.001).

Citation

@article{reasonerrorbench2026,
  title={ReasonErrorBench: A Taxonomy-Driven Analysis of Reasoning Errors in Large Language Models},
  author={Tate Lyman},
  journal={arXiv preprint arXiv:2601.XXXXX},
  year={2026}
}

License

MIT License - see LICENSE file.