GraphRAG-Bench

This is the official repo for GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

🎉 News

[2025-06-03] The official leaderboard is released here: GraphRAG-Bench leaderboard.
[2025-06-03] We have released the paper GraphRAG-Bench.
[2025-06-03] We have released the dataset GraphRAG-Bench.

🏆 Leaderboard

The official leaderboard could be found hereunder: GraphRAG-Bench leaderboard

📚 Dataset

It contains 5 question types spanning 16 disciplines and a corpus of 7 million words from 20 computer science textbooks. The structure of the dataset is shown below:

❓ Question

We defined five types of questions:

Fill-in-blank
Multi-choice
Multi-select
True-or-false
Open-ended

Question/
├── FB.jsonl   #Fill-in-blank
├── MC.jsonl   #Multi-choice
├── MS.jsonl   #Multi-select
├── OE.jsonl   #True-or-false
├── TF.jsonl   #Open-ended

Question example (Open-ended)
{
"Question": "Why is it necessary for the server to use a special initial sequence number in the \n SYNACK?",
 "Level-1 Topic": "Computer networks", 
 "Level-2 Topic": "Network protocols and architectures", 
 "Rationale": "The server uses a special initial sequence number (ISN) in the SYN-ACK to ensure unique connection identification and proper packet sequencing. This also mitigates SYN flood attacks by making it harder for attackers to predict ISNs and hijack sessions.", 
 "Answer": "In order to defend itself against SYN FLOOD attack."
 }

📄 Corpus

We parsed out the content of the textbook. If you only need text content, use .md files (recommend). If you need metadata and structure, use _structured.json.

Corpus/
├── Algorithms/   #Textbook name
│    ├── Algorithms.md
│    └──Algorithms_structured.json
│...
├── Database system concepts/...
└── Speech and Language Processing/...

⚖️ Evaluator

We provide the evaluator.py for evaluation. Place your output files in the following structure:

data_name/
├── question/
│   ├── FB.jsonl
│   ├── MC.jsonl
│   ├── MS.jsonl
│   ├── OE.jsonl
│   └── TF.jsonl
├── output/
│   ├── GraphRAG-Bench_FB
│   ├── GraphRAG-Bench_MC
│   ├── GraphRAG-Bench_MS
│   ├── GraphRAG-Bench_OE
│   └── GraphRAG-Bench_TF
└── results_tmp.json (Generated after the code is run)

🧪 Experiments

Reasoning

Reasoning score R is to evaluate the semantic correspondence and reasoning consistency with gold rationale. The AR metric is to determine whether the model is able to provide correct reasoning when it answers the question accurately.

Accuracy

Accuracy evaluates whether the generated results are consistent with the groundtruth.

✉️ Citation

If you find this repository helpful, please consider citing our paper:

@article{xiao2025graphrag,
  title={GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation},
  author={Xiao, Yilin and Dong, Junnan and Zhou, Chuang and Dong, Su and Zhang, Qianwen and Yin, Di and Sun, Xing and Huang, Xiao},
  journal={arXiv preprint arXiv:2506.02404},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
doc		doc
src/framework		src/framework
README.md		README.md
evaluator.py		evaluator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphRAG-Bench

🎉 News

🏆 Leaderboard

📚 Dataset

❓ Question

📄 Corpus

⚖️ Evaluator

🧪 Experiments

Reasoning

Accuracy

✉️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GraphRAG-Bench

🎉 News

🏆 Leaderboard

📚 Dataset

❓ Question

📄 Corpus

⚖️ Evaluator

🧪 Experiments

Reasoning

Accuracy

✉️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages