AeroGraph is a benchmark-pathology study of graph-augmented RAG over aviation
safety narratives. We build a hybrid retrieval system over 2,000 NASA ASRS
incident reports and evaluate six retrieval configurations on a 50-query
benchmark with paired Wilcoxon significance tests and dual LLM judges. Within
the primary corpus, graph-augmented systems significantly beat a vector
baseline (
The Space runs in cached mode: click any of the 10 preset showcase queries
and see a grounded answer with ACN citations. No API key required. For live
queries against the full hybrid retriever, clone the repo and set
ANTHROPIC_API_KEY.
ASRS reports (2,000 real narratives)
│
▼
┌───────────────────┐
│ LLM extraction │ Claude Sonnet, structured JSON
│ 10 entity types, │ HFACS / ICAO taxonomy normalization
│ 8 edge types │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Knowledge graph │ NetworkX / Neo4j
│ 23,948 nodes │ Leiden communities
│ 48,479 edges │
└─────────┬─────────┘
│
┌──────┴──────┬──────────┬───────────┐
▼ ▼ ▼ ▼
ChromaDB BM25 Personalized Community
4,710 chunks inverted PageRank summaries
idx α=0.15 top 100
│ │ │ │
└────────RRF fusion (k=45, tuned)────┘
│
▼
Claude generation with ACN citations
| Metric | Primary corpus | Held-out corpus |
|---|---|---|
| Hybrid vs Vector P@10 (paired Wilcoxon |
0.047 | 0.14 |
| GraphRAG vs Vector P@10 |
0.018 | 0.08 |
| Bootstrap 95% CI on $\Delta$P@10 (Hybrid − Vector) | ||
| Kendall |
— | |
| Spearman |
— | |
| Multi-hop $\Delta$P@10(Hybrid − Vector) |
|
|
| Ollama faithfulness (Hybrid) | 0.685 | 0.722 |
| Oracle-dead queries (all systems P@10=0) | 12/50 | 24/50 |
Full numbers: paper/results/eval_results_claude_judge.json (primary) and
paper/results/eval_results_heldout.json (held-out).
Cached demo (no API key required):
git clone https://github.com/Aryan95614/AeroGraph.git
cd AeroGraph
make repromake repro installs the package, verifies data artifacts, runs the
integration test, and launches the Gradio demo on localhost:7860 in
cached mode with 10 showcase queries pre-answered.
Live mode (requires ANTHROPIC_API_KEY):
export ANTHROPIC_API_KEY=sk-ant-...
python app.pyReproduce the benchmark from scratch (requires API credits, ~2h):
make ingest # parse 2,000 reports from HF
make extract # Claude entity/relation extraction (~$25)
make build # build canonicalized graph
make embed # ChromaDB index
make eval # 50-query × 6-system benchmark (~$15)
make paper # regenerate figures + populate LaTeX tablessrc/aerograph/
ingest.py parse ASRS narratives + aircraft normalization
extract.py LLM-based entity/relation extraction
taxonomy.py HFACS / ICAO / phase / weather canonicalization
graph.py NetworkX + Neo4j backends, 2-hop BFS, PageRank
embed.py ChromaDB chunking and indexing
community.py Leiden detection + LLM summarization
retrieve.py 6 retrievers + RRF fusion
generate.py Claude answer generation with ACN citations
eval.py 50-query benchmark, dual-judge, checkpointing
api.py FastAPI server (/query, /stats, /graph/entity)
dashboard.py Streamlit interactive dashboard
tests/ integration and unit tests
scripts/ pipeline runners, HF upload, figure generation
paper/ LaTeX source, figures, results JSONs
@misc{dhawan2026aerograph,
title = {Extractive Oracle Circularity: Why Graph-RAG Benchmarks
Fail Cross-Corpus Replication},
author = {Dhawan, Aryan},
year = {2026},
eprint = {TBA},
archivePrefix = {arXiv},
url = {https://github.com/Aryan95614/AeroGraph}
}MIT — see LICENSE. ASRS narratives are NASA public-domain data; our extracted knowledge graph and benchmark queries are released under CC-BY-4.0.