Skip to content

bigbases/FAIR-RAG

Repository files navigation

FAIR-RAG: An End-to-End Framework for Mitigating Political Bias through Fair Retrieval-Augmented Generation

Paper OpenReview License

Accepted at SIGIR 2026 (The 49th International ACM SIGIR Conference on Research and Development in Information Retrieval), July 20--24, 2026, Melbourne, Australia.

Abstract

Retrieval-Augmented Generation (RAG) systems can amplify political bias from underlying web corpora. To empirically demonstrate this amplification, we first analyze 16,254 documents from the C4 dataset and 24,300 LLM-generated responses, revealing significant left-leaning and supportive stance bias that can propagate strongly from retrieval to generation. To mitigate this amplification of political bias, we propose FAIR-RAG, an end-to-end framework integrating (1) multi-LLM persona-based annotation, (2) a vector database with political-stance metadata, and (3) a multi-stage fairness engine designed for each of the three stages in RAG systems.

Key Results

Metric FAIR-RAG Improvement
AWRF (Attention Weighted Rank Fairness) 97.51 82.1% over SOTA
Perspective Balance (GPT / Gemini) 51.01 / 82.37 Avg. 5.6% over SOTA
Context Precision 0.972 Maintained high quality
Faithfulness 0.995 Maintained high quality

Architecture

FAIR-RAG comprises three components that operate collaboratively across the entire RAG pipeline:

                        FAIR-RAG Pipeline
 ┌─────────────────┐  ┌─────────────┐  ┌──────────────────────────────┐
 │ FAIR-Annotation  │→ │  FAIR-KB     │→ │        FAIR-Engine           │
 │                  │  │             │  │                              │
 │ Multi-LLM        │  │ ChromaDB +  │  │  R-stage   A-stage  G-stage │
 │ Persona-based    │  │ Political-  │  │ Balanced → Context → Fair   │
 │ Annotation       │  │ Stance      │  │ Retrieval  Augment  Response│
 │ (4 personas x    │  │ Metadata    │  │  (Quota)   (Meta)   (Guide) │
 │  2 LLMs = 8)     │  │             │  │                              │
 └─────────────────┘  └─────────────┘  └──────────────────────────────┘

1. FAIR-Annotation (0_annotation/)

Multi-LLM persona-based annotation using 8 annotators (4 personas x 2 LLMs: Claude Sonnet 4, GPT-4.1). Each annotator independently assigns Political Orientation and Topic-specific Stance scores (-1.0 to +1.0) with categorical labels, aggregated via majority voting.

2. FAIR-KB (1_doc2vec/)

A metadata-augmented vector database built on ChromaDB with HNSW indexing. Documents are chunked (512 tokens, 50-token overlap), embedded via multilingual-e5-large-instruct (1024-dim), and indexed with political-stance metadata for balanced retrieval.

3. FAIR-Engine (2_rag/)

A three-stage fairness engine:

  • R-stage (Balanced Retrieval): Quota-based retrieval across 9 perspective categories (3 orientations x 3 stances), ensuring equal representation.
  • A-stage (Awareness Context Augmentation): Re-ranks documents by relevance and constructs politically-aware context with metadata summaries and annotated documents.
  • G-stage (Fair Response Generation): Injects fairness guidelines into the system prompt, directing the LLM to produce balanced multi-perspective responses.

Project Structure

Fair-RAG/
├── 0_annotation/                # FAIR-Annotation pipeline
│   ├── annotation_process.py    # Multi-persona annotation with LLMs
│   ├── c4_collection.py         # C4 dataset collection and filtering
│   ├── chatgpt/                 # ChatGPT API integration
│   ├── claude/                  # Claude API integration
│   └── prompt/                  # Persona-based prompt templates
├── 1_doc2vec/                   # FAIR-KB construction
│   ├── ingest_csv_to_chroma.py  # CSV to ChromaDB ingestion
│   └── check_query2db.py       # Database query validation
├── 2_rag/                       # FAIR-Engine implementation
│   └── OUR_rag_system.py       # Core RAG system with R-A-G fairness
├── querylist/                   # Query datasets for evaluation
│   ├── debate_questions_general.csv
│   ├── debate_questions_oppose.csv
│   └── debate_questions_support.csv
├── topics-questions.csv         # 15 politically sensitive topics and keywords
├── c4_analysis_summary.csv      # C4 corpus bias analysis results
├── metric_base_test_with_gpt.py     # Baseline evaluation (GPT judge)
├── metric_base_test_with_gemini.py  # Baseline evaluation (Gemini judge)
├── metric_fairness_test_with_gpt.py # Fairness evaluation (GPT judge)
├── metric_fairness_test_with_gemini.py  # Fairness evaluation (Gemini judge)
└── test_ablation_study.py       # Ablation study framework

Getting Started

Prerequisites

  • Python 3.8+
  • API keys for OpenAI (GPT-4.1) and Anthropic (Claude Sonnet 4)
  • Ollama for local LLM inference (llama3.1, qwen3, gemma3, gpt-oss)

Installation

pip install chromadb sentence-transformers openai anthropic

Usage

Step 1: Annotate documents with political-stance metadata

cd 0_annotation
python annotation_process.py

Step 2: Build the FAIR-KB vector database

cd 1_doc2vec
python ingest_csv_to_chroma.py

Step 3: Run FAIR-RAG

cd 2_rag
python OUR_rag_system.py

Step 4: Evaluate

# Baseline evaluation
python metric_base_test_with_gpt.py
python metric_base_test_with_gemini.py

# Fairness evaluation
python metric_fairness_test_with_gpt.py
python metric_fairness_test_with_gemini.py

# Ablation study
python test_ablation_study.py

Citation

@inproceedings{you2026fairrag,
  title={FAIR-RAG: An End-to-End Framework for Mitigating Political Bias through Fair Retrieval-Augmented Generation},
  author={You, Jaebeom and Lee, Kisung and Kwon, Hyuk-Yoon},
  booktitle={Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26)},
  year={2026},
  address={Melbourne, Australia},
  publisher={ACM}
}

Authors

  • Jaebeom You - Graduate School of Data Science, Seoul National University of Science and Technology
  • Kisung Lee - Division of Computer Science, Louisiana State University
  • Hyuk-Yoon Kwon* - Graduate School of Data Science, Seoul National University of Science and Technology; College of Computing, Georgia Institute of Technology

*Corresponding Author

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages