medhallu-eval

Inspect AI implementation of the MedHallu benchmark — detecting hallucinations in LLM-generated medical answers.

Dataset: UTAustin-AIHealth/MedHallu (pqa_labeled)
Task: Classify whether a candidate answer to a PubMed-derived question is factual or hallucinated.
Metric: Accuracy (overall and by difficulty: easy / medium / hard)

Setup

pip install -e ".[dev]"

Run

# Full eval (~2× dataset size — two samples per row)
inspect eval src/medhallu/medhallu.py@medhallu

# Quick smoke test (first 20 rows → 40 samples)
inspect eval src/medhallu/medhallu.py@medhallu --limit 40

# Against a specific model
inspect eval src/medhallu/medhallu.py@medhallu --model openai/gpt-4o

Test

pytest tests/

Citation

@article{pandit2024medhallu,
  title={MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models},
  author={Pandit, Shrey and Makkena, Ashok Vardhan and Nambi, Akshay and others},
  journal={arXiv preprint arXiv:2408.08511},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
register/medhallu		register/medhallu
src/medhallu		src/medhallu
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

medhallu-eval

Setup

Run

Test

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

medhallu-eval

Setup

Run

Test

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages