💫 Table of Contents (Click to expand)
- We propose AirQA, a human-annotated multi-modal multi-task multi-paper QA dataset with function-based instance-specific evaluations. To the best of our knowledge, AirQA is the first dataset that encompasses multiple question types, also the first to bring function-based evaluation into QA domain, enabling convenient and systematic assessment of research capabilities.
- We introduce ExTrActor, a document-based framework aiming at the synthesis of QA examples, interaction trajectories and instruction data, serving as an empirical method for improving the agent's multi-turn tool-using ability without the involvement of manual annotation.
- We evaluate various LLMs and different QA baselines on our AirQA dataset, demonstrating the quality of our dataset, and indicating the insufficiency of current methods. Extensive experiments on instruction tuning reveal that, small models significantly benefit from our synthetic instruction data, validating the effectiveness of our proposed ExTrActor framework.
-
Create the conda environment and install dependencies:
conda create -n airqa python=3.10 conda activate airqa pip install -r requirements.txt
-
(Optional) Download related files:
We have included the full QA data in our repository at
data/test_data.jsonl, so it is sufficient to download only the paper-related metadata, processed data, and PDFs as required. Note that the evaluation itself doesn't require these paper-related data, but you can use the provided data to help answer the questions.python utils/download_utils.py --datatype metadata processed_data # Add 'papers' if you want to run ExTrActor # Note that this requires additional disk space (~60G)
You can also download these files manually from our official Hugging Face repository and organize them into the following folder structure:
Click to see the folder structure 👇🏻
AirQA |── data/ | |── metadata/ | | |── .gitkeep | | |── 000ab6db-4b65-5dc0-8393-fbc2c05843c8.json | | └── ... # more metadata dicts | |── papers/ | | |── .gitkeep | | |── acl2016/ | | | └── 16c3a7ad-d638-5ebf-a72a-bd58f06c16d7.pdf | | |── acl2019/ | | | └── c7563d97-695f-5c77-8021-334bf2ff9ddb.pdf | | |── acl2023/ | | | |── 001ab93b-7665-5d56-a28e-eac95d2a9d7e.pdf | | | └── ... # more .pdf published in ACL 2023 | | └── ... # other sub-folders of paper collections | └── processed_data/ | |── .gitkeep | |── 000ab6db-4b65-5dc0-8393-fbc2c05843c8.json # cached data for PDF parsing | └── ... # more cached data for PDFs └── ... # other folders and files
-
Evaluate your answers on AirQA.
export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxx" export OPENAI_BASE_URL="https://api.openai.com/v1" python utils/eval_utils.py \ --gold data/test_data.jsonl \ --dir results/example
See the Evaluation document for more details.
-
(Optional) Use ExTrActor to automately generate examples. Running ExTrActor requires further configuration and preparation, please refer to the ExTrActor document for more details.
Fine-grained documents in this project are detailed in folder documents/. Here is the checklist:
| Documents | Description |
|---|---|
📗 documents/data_format.md |
Example and paper data format for AirQA. |
📕 documents/evaluation.md |
Evaluation functions, arguments and scripts for AirQA. |
📘 documents/extractor.md |
Details on automately generating examples with ExTrActor. |
If you find this dataset useful, please cite our work:
@misc{huang2025airqacomprehensiveqadataset,
title={AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation},
author={Tiancheng Huang and Ruisheng Cao and Yuxin Zhang and Zhangyi Kang and Zijian Wang and Chenrun Wang and Yijie Luo and Hang Zheng and Lirong Qian and Lu Chen and Kai Yu},
year={2025},
eprint={2509.16952},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.16952},
}