DNC Framework

Code and data for the EMNLP 2022 paper Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of NLP Systems.

The DNC (Diagnosing Numerical Capability) framework is proposed to probe the robustness in systems on Question Answering datasets that require numerical reasoning capabilities.

Four numerical capabilities, stemming from the two solving stages of numerical QA questions, are highlighted. Accordingly, eight perturbations are designed to probe these capabilities. Being trivial to humans, these perturbations are expected not to affect the system performance significantly.

Empirical Results show that current systems are errorprone on the perturbed test set (“Attack”), and demonstrate nontrivial performance drop even trained on the perturbed training set (“Defense”).

Datasets & Models

Datasets

For all datasets, we refer users to original sources for the dataset files.

Dataset	Source	Original Paper	Comment
ASDiv-a	https://github.com/LYH-YF/MWPToolkit	(Miao et al., 2020)
DROP	https://allenai.org/data/drop	(Dua et al., 2019)	DROP-num filtered by us
TATQA	https://nextplusplus.github.io/TAT-QA/	(Zhu et al., 2022)	TATQA-a filtered by us

We provide the fitlered DROP-num, TATQA-a and the manually curated Logic Attack challenge set for ASDiv-a in data/dataset/.

data
│
├─dataset
│  ├─asdiv-a-manual
│  │      testset.json
│  │      trainset.json
│  │      validset.json
│  │
│  ├─drop-num
│  │      testset.json
│  │      trainset.json
│  │      validset.json
│  │
│  └─tatqa-a
│          testset.json
│          trainset.json
│          validset.json
├─number_tokenizer
└─perturbation

We also provide the perturbing scripts we used to create the Attack datasets and the Defense datasets in data/perturbation/, which can be called with

python -m data.perturbation.asdiv_a.asdiv_a_auto
python -m data.perturbation.drop_num.drop_num_auto
python -m data.perturbation.tatqa.tatqa_auto

Models

For T5 and BART, we provide the scripts for experiments with pytorch-lightning.

code
│
├─Gen_DROP
│      config.py
│      drop_dataset.py
│      drop_model.py
│      exp.py
│
└─Gen_MWP
        asdiv_dataset.py
        asdiv_model.py
        config.py
        exp.py

The can be called with

export CUDA_VISIBLE_DEVICES=0 # or more gpus, lightning automatically handles them
export SEED=<your_seed>
export MODEL_NAME=bart # or t5
export DATASET=asdiv-a # or drop-num / tatqa-a
export SETTING=atk # or def
export EPOCH=<your_epoch_number>

python -m Gen_MWP.exp \
--seed ${SEED} \
--model_name ${MODEL_NAME} \
--root_dataset_name ${DATASET} \
--setting_name ${SETTING} \
--max_epoch ${EPOCH}

For GPT2, Graph2Tree, and TagOps, please refer to their existing implementation.

Model	Source	Original Paper
GPT2	https://github.com/LYH-YF/MWPToolkit	(Radford et al., 2019)
Graph2Tree	https://github.com/LYH-YF/MWPToolkit	(Zhang et al., 2020)
TagOps	https://github.com/NExTplusplus/TAT-QA	(Zhu et al., 2022)

Dependency

We assume the usage of conda as the environment management tool.

conda create -n dnc python=3.9
conda activate dnc

pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install -U mwptoolkit pytorch-lightning
pip install -U -r requirements.txt

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Reference

If you find our work useful for your research, please consider citing

@inproceedings{xu-etal-2022-towards-robust,
    title = "Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of {NLP} Systems",
    author = "Xu, Jialiang  and
      Zhou, Mengyu  and
      He, Xinyi  and
      Han, Shi  and
      Zhang, Dongmei",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.542",
    pages = "7950--7966",
    abstract = "Numerical Question Answering is the task of answering questions that require numerical capabilities. Previous works introduce general adversarial attacks to Numerical Question Answering, while not systematically exploring numerical capabilities specific to the topic. In this paper, we propose to conduct numerical capability diagnosis on a series of Numerical Question Answering systems and datasets. A series of numerical capabilities are highlighted, and corresponding dataset perturbations are designed. Empirical results indicate that existing systems are severely challenged by these perturbations. E.g., Graph2Tree experienced a 53.83{\%} absolute accuracy drop against the {``}Extra{''} perturbation on ASDiv-a, and BART experienced 13.80{\%} accuracy drop against the {``}Language{''} perturbation on the numerical subset of DROP. As a counteracting approach, we also investigate the effectiveness of applying perturbations as data augmentation to relieve systems{'} lack of robust numerical capabilities. With experiment analysis and empirical studies, it is demonstrated that Numerical Question Answering with robust numerical capabilities is still to a large extent an open question. We discuss future directions of Numerical Question Answering and summarize guidelines on future dataset collection and system design.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
code		code
data		data
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DNC Framework

Datasets & Models

Datasets

Models

Dependency

Contributing

Trademarks

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

microsoft/NumberDiagnosis

Folders and files

Latest commit

History

Repository files navigation

DNC Framework

Datasets & Models

Datasets

Models

Dependency

Contributing

Trademarks

Reference

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages