Skip to content

nath-netizen/medhallu-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

medhallu-eval

Inspect AI implementation of the MedHallu benchmark — detecting hallucinations in LLM-generated medical answers.

Dataset: UTAustin-AIHealth/MedHallu (pqa_labeled)
Task: Classify whether a candidate answer to a PubMed-derived question is factual or hallucinated.
Metric: Accuracy (overall and by difficulty: easy / medium / hard)

Setup

pip install -e ".[dev]"

Run

# Full eval (~2× dataset size — two samples per row)
inspect eval src/medhallu/medhallu.py@medhallu

# Quick smoke test (first 20 rows → 40 samples)
inspect eval src/medhallu/medhallu.py@medhallu --limit 40

# Against a specific model
inspect eval src/medhallu/medhallu.py@medhallu --model openai/gpt-4o

Test

pytest tests/

Citation

@article{pandit2024medhallu,
  title={MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models},
  author={Pandit, Shrey and Makkena, Ashok Vardhan and Nambi, Akshay and others},
  journal={arXiv preprint arXiv:2408.08511},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages