SIGIR'25: An Empirical Study of Evaluating Long-form Question Answering

Repo of code and data for SIGIR-25 full paper "An Empirical Study of Evaluating Long-form Question Answering" Abstract: LFQA aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored.We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at https://github.com/bugtig6351/lfqa_evaluation.

Data

All data is stored in the data folder, saved in JSON file format, and the subfolder names correspond to the dataset names.

src

This folder contains the necessary model calling interfaces, prompts, and tool functions.

run

The main experiment section, with code stored in the form of Jupyter Notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
img		img
run		run
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIGIR'25: An Empirical Study of Evaluating Long-form Question Answering

Data

src

run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SIGIR'25: An Empirical Study of Evaluating Long-form Question Answering

Data

src

run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages