Lightweight tri-fusion retrieval with prompt-engineered faithful generation for multi-turn RAG.
This repository contains the code, paper, and reproducibility utilities for our SemEval-2026 Task 8 (MTRAGEval) system. The public GitHub version is intentionally focused on the material needed to understand, reproduce, and extend the paper. It excludes benchmark data, built indices, cached embeddings, generated submissions, logs, and local research artifacts.
Paper: paper/main.pdf
Paper source: paper/main.tex
Documentation index: docs/README.md
Citation metadata: CITATION.cff
| Task | Metric | Score | Rank |
|---|---|---|---|
| A | nDCG@5 | 0.433 | 20/38 |
| B | H-mean (RB_agg, RL_F, RB_llm) | 0.756 | 6/26 |
| C | H-mean (RB_agg, RL_F, RB_llm) | 0.533 | 14/29 |
Included:
- source code for retrieval, generation, and submission pipelines
- camera-ready analysis scripts used for the final paper revision
- SLURM job scripts used on Purdue Gilbreth
- LaTeX source and compiled camera-ready PDF
- lightweight documentation for setup and reproduction
Intentionally excluded:
- benchmark corpora and task files
- generated indices, embeddings, and cached model outputs
- evaluation logs, intermediate results, and local notebooks
- organizer-provided private analytics files and confidential evaluation artifacts
.
├── src/mtrageval/ # Python package: retrieval, generation, analysis helpers
├── scripts/
│ ├── setup/ # Index and cache construction
│ ├── submission/ # Official Task A/B/C generation pipelines
│ ├── analysis/ # Retrieval, prompt, and camera-ready analyses
│ ├── evaluation/ # Local evaluation utilities
│ └── validation/ # Format and sanity checks
├── tests/ # Unit tests for camera-ready analysis code
├── slurm/ # Gilbreth cluster job scripts
├── paper/ # ACL/SemEval paper source and PDF
├── docs/ # Reproducibility and repository documentation
├── requirements.txt
└── README.md
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDownload the MTRAGEval benchmark from the official repository:
The public repo assumes the following local layout after download:
data/
├── corpora/
│ ├── clapnq.jsonl
│ ├── cloud.jsonl
│ ├── fiqa.jsonl
│ └── govt.jsonl
├── rag_taskAC.jsonl
└── reference_taskB.jsonl
Some camera-ready analyses also expect a local benchmark checkout under:
external/mt-rag-benchmark/
Task B and Task C generation require an OpenAI API key:
export OPENAI_API_KEY="your-key"python scripts/setup/build_bm25_indices.py
python scripts/setup/build_splade_indices.py
python scripts/setup/build_jina_v4_index.pyTask A:
python scripts/submission/generate_taska_submission.py \
--input-file data/rag_taskAC.jsonl \
--output-file "PFW Task 8_taskA.jsonl" \
--team-name "PFW Task 8"Task B:
python scripts/submission/generate_taskb_v2.py \
--input-file data/reference_taskB.jsonl \
--output-file "PFW Task 8_taskB.jsonl" \
--model gpt-4oTask C:
python scripts/submission/generate_taskc_submission.py \
--input-file data/rag_taskAC.jsonl \
--output-file "PFW Task 8_taskC.jsonl" \
--team-name "PFW Task 8"The final paper uses the following analysis scripts:
PYTHONPATH=src python scripts/analysis/camera_ready_retrieval.py --repo-root .
PYTHONPATH=src python scripts/analysis/camera_ready_prompt_ablation.py --repo-root .
PYTHONPATH=src python scripts/analysis/camera_ready_taskc_control.py --repo-root .
PYTHONPATH=src python scripts/analysis/camera_ready_summary.py --repo-root .Detailed notes are in docs/reproducibility.md.
Run the analysis test suite with:
PYTHONPATH=src python -m unittest discover -s tests -p 'test_camera_ready*.py'- Retrieval indexing and large-scale analyses were run on Purdue Gilbreth.
- A100-80GB GPUs were used for dense/sparse indexing.
- A30 GPUs were used for later camera-ready analyses.
- Task B and Task C generation use the OpenAI API.
@inproceedings{tamsal2026pfw,
title = {{PFW Task 8} at {SemEval}-2026 Task 8: Lightweight Tri-Fusion Retrieval with Prompt-Engineered Faithful Generation for Multi-Turn {RAG}},
author = {Tamsal, Taleef and Rusert, Jonathan},
booktitle = {Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026)},
year = {2026},
address = {San Diego, California},
organization = {Association for Computational Linguistics}
}- Benchmark homepage: https://ibm.github.io/mt-rag-benchmark/MTRAGEval/
- Benchmark repository: https://github.com/IBM/mt-rag-benchmark
- SemEval-2026: https://semeval.github.io/
- Taleef Tamsal:
tamst01@pfw.edu - Jonathan Rusert:
jrusert@pfw.edu