AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
ICLR 2026 Poster

💫 Table of Contents (Click to expand)

💡 Main Contributions
🔍 Quick Start
📚 Detailed Documents and Tutorials
✍🏻 Citation

💡 Main Contributions

We propose AirQA, a human-annotated multi-modal multi-task multi-paper QA dataset with function-based instance-specific evaluations. To the best of our knowledge, AirQA is the first dataset that encompasses multiple question types, also the first to bring function-based evaluation into QA domain, enabling convenient and systematic assessment of research capabilities.
We introduce ExTrActor, a document-based framework aiming at the synthesis of QA examples, interaction trajectories and instruction data, serving as an empirical method for improving the agent's multi-turn tool-using ability without the involvement of manual annotation.
We evaluate various LLMs and different QA baselines on our AirQA dataset, demonstrating the quality of our dataset, and indicating the insufficiency of current methods. Extensive experiments on instruction tuning reveal that, small models significantly benefit from our synthetic instruction data, validating the effectiveness of our proposed ExTrActor framework.

🔍 Quick Start

Create the conda environment and install dependencies:

conda create -n airqa python=3.10
conda activate airqa
pip install -r requirements.txt

(Optional) Download related files:

We have included the full QA data in our repository at data/test_data.jsonl, so it is sufficient to download only the paper-related metadata, processed data, and PDFs as required. Note that the evaluation itself doesn't require these paper-related data, but you can use the provided data to help answer the questions.

python utils/download_utils.py --datatype metadata processed_data
# Add 'papers' if you want to run ExTrActor
# Note that this requires additional disk space (~60G)

You can also download these files manually from our official Hugging Face repository and organize them into the following folder structure:

Click to see the folder structure 👇🏻

AirQA
|── data/
|   |── metadata/
|   |   |── .gitkeep
|   |   |── 000ab6db-4b65-5dc0-8393-fbc2c05843c8.json
|   |   └── ... # more metadata dicts
|   |── papers/
|   |   |── .gitkeep
|   |   |── acl2016/
|   |   |   └── 16c3a7ad-d638-5ebf-a72a-bd58f06c16d7.pdf
|   |   |── acl2019/
|   |   |   └── c7563d97-695f-5c77-8021-334bf2ff9ddb.pdf
|   |   |── acl2023/
|   |   |   |── 001ab93b-7665-5d56-a28e-eac95d2a9d7e.pdf
|   |   |   └── ... # more .pdf published in ACL 2023
|   |   └── ... # other sub-folders of paper collections
|   └── processed_data/
|       |── .gitkeep
|       |── 000ab6db-4b65-5dc0-8393-fbc2c05843c8.json # cached data for PDF parsing
|       └── ... # more cached data for PDFs
└── ... # other folders and files

Evaluate your answers on AirQA.

export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxx"
export OPENAI_BASE_URL="https://api.openai.com/v1"
python utils/eval_utils.py \
    --gold data/test_data.jsonl \
    --dir results/example

See the Evaluation document for more details.

(Optional) Use ExTrActor to automately generate examples. Running ExTrActor requires further configuration and preparation, please refer to the ExTrActor document for more details.

📚 Detailed Documents and Tutorials

Fine-grained documents in this project are detailed in folder documents/. Here is the checklist:

Documents	Description
📗 `documents/data_format.md`	Example and paper data format for AirQA.
📕 `documents/evaluation.md`	Evaluation functions, arguments and scripts for AirQA.
📘 `documents/extractor.md`	Details on automately generating examples with ExTrActor.

✍🏻 Citation

If you find this dataset useful, please cite our work:

@misc{huang2025airqacomprehensiveqadataset,
      title={AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation}, 
      author={Tiancheng Huang and Ruisheng Cao and Yuxin Zhang and Zhangyi Kang and Zijian Wang and Chenrun Wang and Yijie Luo and Hang Zheng and Lirong Qian and Lu Chen and Kai Yu},
      year={2025},
      eprint={2509.16952},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.16952}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
ICLR 2026 Poster

💡 Main Contributions

🔍 Quick Start

📚 Detailed Documents and Tutorials

✍🏻 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
documents		documents
evaluation		evaluation
extractor		extractor
results/example		results/example
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level EvaluationICLR 2026 Poster

💡 Main Contributions

🔍 Quick Start

📚 Detailed Documents and Tutorials

✍🏻 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
ICLR 2026 Poster

Packages