Benchmarking Chinese LLMs with Localized Professional Qualifications
[ Read the Paper ]
To access QualBench, copy and run the following code:
from datasets import load_dataset
dataset = load_dataset("mengze-hong/QualBench")or you can download the data directly from ./data.
Qualification examinations in China are rigorous, standardized tests that certify professionals across diverse fields, ensuring they meet both industry and regulatory standards. Serving as critical gateways to professional practice, they provide a trusted measure of domain expertise in real-world contexts. We introduce QualBench, the first multi-domain Chinese QA benchmark built to evaluate LLM performance in localized, professional settings. Featuring 17,316 expert-verified questions from 26 national qualification exams, QualBench bridges the gap in current benchmarks by offering broad domain coverage and capturing the unique knowledge demands of China’s professional landscape.
📅 August 21, 2025: QualBench has been accepted to EMNLP 2025 Main Conference!
| Dataset | Source Qualification Exam | Size | Best Model | Vertical Domain | Localization | Explainable |
|---|---|---|---|---|---|---|
| GAOKAO-Bench | Chinese College Entrance Examination (Gaokao) | 2,811 | GPT-4 | ❌ | ✅ | ❌ |
| CFLUE | Finance Qualification Exams | 38,636 | Qwen-72B | Finance | ❌ | ✅ |
| M3KE | Entrance Exams of Different Education Levels | 20,477 | GPT-3.5 | ❌ | ✅ | ❌ |
| FinEval | Finance Qualification Exams | 8,351 | GPT-4o | Finance | ❌ | ❌ |
| CMExam | Chinese National Medical Licensing Exam | 68,119 | GPT-4 | Medical | ❌ | ❌ |
| LogiQA | Civil Servants Exams of China | 8,678 | RoBERTa | ❌ | ✅ | ✅ |
| QualBench (ours) | Multiple Sources | 17,316 | Qwen-7B | Multiple | ✅ | ✅ |
Evaluate with batch inference on QualBench with the following command:
python ./src/test_QualBench.py \
--model baichuan-inc/Baichuan-13B-Chat \
--batch_size 32 \
--output_path res_baichuan13b.jsonl
# use --model to specify the model path or name (Hugging Face repo or local path)
# use --batch_size to control the number of samples processed per inference batch
# use --output_path to set the output JSONL file path
Warning
Batch inference can be highly resource-intensive.
For optimal performance, we recommend using an H20 GPU and keeping the batch size at 64 or fewer.
Additionally, you can:
- Fine-tune your own models on our pre-processed datasets. See the example in
./src/finetune_FinLLM.py. - Run evaluations on existing models (both local and API-based). Examples are available in
./src/example. - Conduct ablation studies on key LLM concerns, such as:
- Detecting data contamination (
./src/example/test_shuffled.py) - Evaluating prompt engineering strategies (
./src/example/test_prompt.py) - Experimenting with LLM crowdsourcing techniques (
./src/example/aggregation)
- Detecting data contamination (
We warmly welcome collaboration from the broader NLP, Machine Learning, and Education communities. Whether it’s improving our methods, expanding the dataset, or exploring new evaluation directions, we’re eager to work together to push this project further.
For any discussions or inquiries, please reach out to Mengze Hong at mengze.hong@connect.polyu.hk.
If you find our work helpful, please use the following citations.
@inproceedings{hong2025qualbench,
title={QualBench: Benchmarking Chinese {LLM}s with Localized Professional Qualifications for Vertical Domain Evaluation},
author={Mengze Hong and Wailing Ng and Chen Jason Zhang and Di Jiang},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
}