Production-grade Retrieval-Augmented Generation for EEG research literature β multi-agent, medically cited, instantly queryable.
Important
Research/Clinical Disclaimer: EEG-RAG is designed for research and educational purposes. All retrieved citations must be independently verified before clinical decision-making. This system is not a substitute for professional medical advice.
Tip
Get started in 5 minutes: pip install -e . && uvicorn eeg_rag.api.main:app --reload then visit http://localhost:8080/docs
- Overview
- Key Features
- Architecture
- Agent Roster
- Quick Start
- Usage
- Paper Database
- Technology Stack
- EEG Domain Knowledge
- Advanced Retrieval
- Systematic Review
- Bibliometrics
- Enterprise Features
- Project Roadmap
- Development Status
- Development
- Claim Verification
- Contributing
- Changelog
- License & Acknowledgements
EEG-RAG is an enterprise-ready, multi-agent RAG system built specifically for electroencephalography (EEG) research and clinical applications. It processes scientific literature from PubMed (35M+ papers), Semantic Scholar, arXiv, OpenAlex, ClinicalTrials.gov, and Europe PMC, then answers natural-language queries with verified, PMID-cited responses in under 2 seconds.
The problem it solves: EEG researchers spend 40-60% of their time searching literature. PubMed holds 150,000+ EEG papers, but there is no unified way to query that knowledge semantically, verify citations, or synthesize findings across studies.
Who it is for: Clinical EEG researchers, epileptologists, BCI engineers, cognitive neuroscientists, ML engineers working on neural data, and graduate students entering the field.
- β±οΈ Spend less time digging through papers. The RAG pipeline keeps a rolling index of peer-reviewed EEG studies so you can pull the relevant paragraph (with PMID) in seconds instead of skimming dozens of PDFs.1
- π§© See patient-matched precedents. By linking EEG waveforms, clinical context, and prior cases (replicating the EEG-MedRAG methodology that beat other retrieval methods by 5β20 F1 points across seven disorders), you can quickly sanity-check seizure patterns or cognitive task responses against similar cohorts.2
- π Trust the answer because the evidence is attached. Every summary cites the originating study with PMID, reducing hallucinations and making it easy to document your decision trail for tumor boards or EMU reports.1
- π Stay aligned across the care team. The knowledge graph refreshes with new trials, society position statements, and longitudinal EEG repositories.12
| Icon | Feature | Description | Impact | Status |
|---|---|---|---|---|
| π€ | Multi-Agent System | 12 specialized AI agents work in parallel β see full agent table below | High | β Stable |
| π | Hybrid Retrieval | BM25 + Dense vectors + SPLADE learned sparse + Cross-encoder reranking with RRF fusion | High | β Stable |
| π‘ | FastAPI Web Service | REST API with 10 endpoints + Server-Sent Events (SSE) for real-time streaming progress | High | β Stable |
| β | Citation Verification | Medical-grade PMID validation, hallucination detection, retraction checking | Critical | β Stable |
| π§ | PubMedBERT Embeddings | 768-dim domain embeddings pre-trained on 14M PubMed abstracts; selectable via model_preset |
High | β Stable |
| π₯ | Multi-Source Ingestion | PubMed, Semantic Scholar, arXiv, OpenAlex, ClinicalTrials.gov, Europe PMC (120K+ papers) | High | β Stable |
| π₯ | ClinicalTrials.gov | EEG clinical trial data (epilepsy, BCI, neurofeedback, sleep) via REST v2 API with EEG relevance scoring | High | β New |
| π | Europe PMC | Open-access EEG literature via cursor-based pagination with full-text XML retrieval | High | β New |
| π¬ | ResearchAgent | Parallel multi-source coordinator β PubMed + Semantic Scholar + Local in one call with dedup & evidence ranking | High | β New |
| ποΈ | SystematicReviewAgent | Fully automated PRISMA-compliant systematic reviews: dedup β screen β grade β themes β gaps | High | β New |
| π©Ί | ClinicalMatchingAgent | Maps EEG patterns to clinical diagnoses using ACNS terminology, ICD-10 codes, evidence PMIDs and drug effect lookup | High | β New |
| π | CitationAgent | Batch citation validation: impact scoring, retraction detection, ORCID linking, cross-reference checking, open-access status | Critical | β Stable |
| π | Bibliometrics Dashboard | pyBiblioNet integration: trends, citation networks, KeyBERT NLP, Scopus export | Medium | β Stable |
| π¬ | NER System | EEG Named Entity Recognition: 400+ terms across 12 categories (electrodes, bands, ERPs, conditions) | Medium | β Stable |
| ποΈ | Systematic Review (YAML) | YAML-schema extraction, reproducibility scoring, temporal comparison vs Roy et al. 2019 | Medium | β Stable |
| π’ | Enterprise Security | SVG/PDF malware scanning, prompt injection detection, SHA-256 audit trail, OpenTimestamps | Medium | π Beta |
| ποΈ | Knowledge Graph | Neo4j with Cypher queries: multi-hop reasoning across entities (PAPER, BIOMARKER, CONDITION, OUTCOME) | Medium | π Beta |
| π | Adaptive Query Routing | Intelligent routing to optimal agents based on query complexity, 30% latency reduction | Medium | π‘ Planned |
π All 330+ Requirements Covered β Click to Expand
- Phase 1 (Foundation): Architecture, BaseAgent (30 req), QueryPlanner (24 req), MemoryManager (23 req), Orchestrator (18 req)
- Phase 2 (Agents): LocalDataAgent (15 req), WebSearchAgent (15 req), GraphAgent (15 req), CitationValidator (15 req)
- Phase 3 (Aggregation): ContextAggregator (15 req), GenerationEnsemble (20 req), FinalAggregator (15 req)
- Phase 4 (Pipeline): TextChunker (10), EEGCorpus (8), PubMedBERT (10), NER (12 entity types), DataIngestion
- Phase 5 (Advanced): SPLADE (10), Reranker (10), IRMetrics (10), FastAPI (10), Bibliometrics (10)
- Phase 6 (New): ResearchAgent (4 req), SystematicReviewAgent (8 req), ClinicalMatchingAgent (6 req), ClinicalTrialsClient (5 req), EuropePMCClient (5 req), PubMedBERT presets (3 req)
- Total: 330+ requirements, 100% tested
flowchart TD
subgraph Client["Client Layer"]
WEB["Web Browser"]
CLI["CLI / curl"]
SDK["Python SDK"]
end
subgraph API["FastAPI Service"]
REST["/search /paper /suggest"]
STREAM["/search/stream SSE"]
HEALTH["/health /metrics"]
end
subgraph Orchestration["Orchestration Layer"]
QP["QueryPlanner<br/>CoT + ReAct"]
ORCH["Orchestrator<br/>Parallel Coordination"]
MEM["MemoryManager<br/>Short + Long term"]
end
subgraph Agents["12 Specialized Agents"]
A1["LocalSearch<br/>FAISS under 100ms"]
A2["PubMed<br/>E-utilities + MeSH"]
A3["SemanticScholar<br/>Citation graphs"]
A4["KnowledgeGraph<br/>Neo4j + Cypher"]
A5["CitationAgent<br/>PMID + retraction"]
A6["Synthesis<br/>Multi-LLM ensemble"]
A7["ResearchAgent<br/>Multi-source"]
A8["SystematicReview<br/>PRISMA automation"]
A9["ClinicalMatching<br/>EEG to diagnosis"]
end
subgraph Storage["Storage and Data"]
FAISS["FAISS<br/>768-dim vectors"]
NEO["Neo4j<br/>Knowledge Graph"]
REDIS["Redis<br/>Query cache 1h TTL"]
CORPUS["Local Corpus<br/>120K+ papers"]
end
Client --> API
API --> Orchestration
QP --> ORCH
ORCH <--> MEM
ORCH --> A1
ORCH --> A2
ORCH --> A3
ORCH --> A4
ORCH --> A5
ORCH --> A7
ORCH --> A8
ORCH --> A9
A1 --> FAISS
A2 --> CORPUS
A3 --> CORPUS
A5 --> CORPUS
A4 --> NEO
A5 --> A6
A6 --> REST
REDIS -.cache.-> ORCH
style ORCH fill:#2c5282,color:#fff,stroke:#4a90e2
style A6 fill:#15803d,color:#fff,stroke:#22c55e
style REDIS fill:#7f1d1d,color:#fff,stroke:#ef4444
sequenceDiagram
participant U as User
participant API as FastAPI
participant QP as QueryPlanner
participant ORCH as Orchestrator
participant AGENTS as Agents
participant CTX as ContextAggregator
participant LLM as Synthesis
U->>API: POST /search query P300 in depression
API->>QP: Decompose query
QP->>ORCH: Execute plan
ORCH->>AGENTS: Dispatch LocalSearch + PubMed + S2 simultaneously
AGENTS-->>ORCH: Results from each source
ORCH->>CTX: Merge and deduplicate
CTX->>LLM: Top-20 unified context
LLM-->>API: Answer + PMID citations
API-->>U: 200 OK under 2s total
Note over U,LLM: Cache hit 0.05s is 36x faster
mindmap
root((EEG-RAG))
Clinical
Epilepsy
Seizure detection
Interictal spikes
Sleep Medicine
Staging algorithms
Disorder detection
ICU Monitoring
Encephalopathy
Continuous EEG
Psychiatry
Depression biomarkers
Treatment response
Neuroscience
ERPs
P300 N400
N170 MMN
Frequency Bands
Delta Theta
Alpha Beta Gamma
Connectivity
Resting state
Functional networks
BCI and ML
Motor Imagery
P300 Speller
Seizure Prediction
Sleep Staging Models
Research Tools
Systematic Review
Bibliometrics
Citation Verification
NER Extraction
EEG-RAG ships 12 specialized agents, all extending BaseAgent via a common async execute(query) interface. Agents can be used standalone or orchestrated in parallel.
| # | Agent | Type | Focus | How It Works |
|---|---|---|---|---|
| 1 | OrchestratorAgentagents/orchestrator/ |
ORCH |
Central coordinator | Receives a user query, builds a plan with QueryPlanner, fans out to sub-agents in parallel via asyncio.gather, merges ranked results |
| 2 | LocalDataAgentagents/local_agent/ |
LOCAL |
Fast in-corpus retrieval | Hybrid BM25 + FAISS dense search over the 120K-paper corpus; <100 ms for 10K docs via RRF fusion |
| 3 | PubMedAgentagents/pubmed_agent/ |
CLOUD |
Peer-reviewed literature | NCBI E-utilities with MeSH expansion, rate-limited to 3 req/s (10 req/s with API key), returns PMID-annotated results |
| 4 | SemanticScholarAgentagents/semantic_scholar_agent/ |
CLOUD |
Citation graphs & influence | Queries the S2 Graph API for papers + citation counts + influential-citation flags; re-ranks by citation velocity |
| 5 | WebSearchAgentagents/web_agent/ |
WEB |
Web / preprint search | Falls back to web search for topics not covered by academic DBs; handles arXiv and bioRxiv preprints |
| 6 | GraphAgentagents/graph_agent/ |
CLOUD |
Multi-hop reasoning | Runs Cypher queries on Neo4j linking PAPER β BIOMARKER β CONDITION β OUTCOME nodes |
| 7 | CitationAgentagents/citation_agent/ |
AGG |
Citation validation & impact | Validates PMIDs/DOIs; computes impact score (citations + journal IF + recency); detects retractions; batch-validates 100+ papers with caching |
| 8 | SynthesisAgentagents/synthesis_agent/ |
AGG |
Multi-LLM answer generation | Feeds ranked context chunks to a configurable LLM ensemble; includes EvidenceRanker (1aβ5 OCEBM) and hallucination detection |
| 9 | MCPAgentagents/mcp_agent/ |
MCP |
MCP protocol bridge | Exposes all agents as callable tools via Model Context Protocol; enables Claude Desktop and other MCP client integrations |
| 10 | ResearchAgentagents/research_agent/ |
CLOUD |
Multi-source coordinator β¨ | Runs PubMed + SemanticScholar + LocalData in parallel; isolates per-source errors; deduplicates by PMID/DOI/title; applies 13-group EEG synonym expansion |
| 11 | SystematicReviewAgentagents/systematic_review_agent/ |
AGG |
PRISMA automation β¨ | Full PRISMA pipeline: dedup β abstract screening β OCEBM grading β thematic grouping (freq bands, methods, conditions) β gap detection |
| 12 | ClinicalMatchingAgentagents/clinical_matching_agent/ |
LOCAL |
EEG β diagnosis β¨ | 13-entry ACNS pattern KB (spike-wave, hypsarrhythmia, LPDs, GRDA, LRDA, sleep stages, BCI); age modifiers + drug-EEG lookup; returns ICD-10 codes + PMIDs |
Why it was added: Previously, a researcher had to call PubMedAgent, SemanticScholarAgent, and LocalDataAgent separately then manually merge and deduplicate results. ResearchAgent automates this.
What it adds:
- Single async call to search all three sources simultaneously
- 13 EEG-specific synonym expansion groups (
"bci"β"brain-computer interface", etc.) - Cross-source deduplication by PMID β DOI β normalized title
- Evidence-ranked fusion using OCEBM quality levels
How to use:
from eeg_rag.agents.research_agent import ResearchAgent
from eeg_rag.agents.pubmed_agent.pubmed_agent import PubMedAgent
from eeg_rag.agents.semantic_scholar_agent.s2_agent import SemanticScholarAgent
from eeg_rag.agents.local_agent.local_data_agent import LocalDataAgent
from eeg_rag.agents.base_agent import AgentQuery
import asyncio
agent = ResearchAgent(
pubmed_agent=PubMedAgent(config=config),
semantic_scholar_agent=SemanticScholarAgent(config=config),
local_agent=LocalDataAgent(config=config),
)
query = AgentQuery(text="P300 amplitude in Alzheimer disease")
result = await agent.execute(query)
papers = result.data["papers"] # deduplicated, evidence-rankedWhy it was added: Systematic reviews are the gold standard of EEG evidence synthesis but take months manually. This agent automates the screening, grading, and gap-analysis steps.
What it adds:
ReviewProtocolwith PICO structure andInclusionCriteria(min year, required keywords, human-subjects flag, study design filter)- Auto-generated
review_id(MD5 of protocol for reproducibility) - PRISMA stage tracking: identified β screened β eligible β included
- Evidence grading using OCEBM 1aβ5 levels via
EvidenceRanker - Thematic clustering by EEG frequency bands, methods (ICA, ERPs, BCI), and clinical conditions
- Evidence gap detection by scanning for "limitation", "small sample", "further research" patterns
How to use:
from eeg_rag.agents.systematic_review_agent import SystematicReviewAgent
from eeg_rag.agents.systematic_review_agent.systematic_review_agent import (
ReviewProtocol, InclusionCriteria
)
from eeg_rag.agents.base_agent import AgentQuery
agent = SystematicReviewAgent()
# Supply papers from ResearchAgent or any source
query = AgentQuery(
text="What EEG biomarkers predict seizure recurrence after first unprovoked seizure?",
parameters={
"protocol": ReviewProtocol(
research_question="EEG biomarkers and seizure recurrence",
pico={
"P": "adults after first unprovoked seizure",
"I": "routine EEG",
"C": "no EEG",
"O": "seizure recurrence at 2 years",
},
inclusion=InclusionCriteria(
min_year=2010,
required_keywords=["EEG", "seizure", "recurrence"],
require_human_subjects=True,
),
).__dict__,
"papers": papers, # list of dicts from ResearchAgent
},
)
result = await agent.execute(query)
review = result.data
print(f"Included: {review['included']}/{review['identified']}")
print(f"Themes: {review['themes']}")
print(f"Gaps: {review['gaps']}")Why it was added: Clinical EEG interpretation requires recognizing specific patterns and mapping them to diagnoses, drug effects, and ACNS terminology. This agent provides structured decision support.
What it adds:
- 13-entry pattern knowledge base covering: 3 Hz spike-wave (absence epilepsy), hypsarrhythmia (West syndrome), GRDA (encephalopathy/non-convulsive status), LPDs (acute cortical injury), frontal FIRDA (metabolic), K-complexes/sleep spindles/delta waves (sleep staging), beta spindles (benzodiazepines), burst suppression (propofol/deep anesthesia), BETS/POSTS/mu rhythm (normal variants and BCI)
- Age-group confidence modifiers (neonate / infant / child / adult / elderly)
- Drug-EEG effect lookup for 8 common medications
- Returns ICD-10 codes, ACNS standard terms, evidence PMIDs, and differential diagnoses
- Built-in clinical disclaimer (not a substitute for qualified neurophysiologist review)
How to use:
from eeg_rag.agents.clinical_matching_agent import ClinicalMatchingAgent
from eeg_rag.agents.base_agent import AgentQuery
import asyncio
agent = ClinicalMatchingAgent()
query = AgentQuery(
text="generalised 3 Hz spike-wave discharges during absence seizures",
parameters={
"age_group": "child",
"medications": ["valproate"],
"clinical_context": "staring spells school-age child, normal MRI",
},
)
result = await agent.execute(query)
for match in result.data["matches"]:
print(f"{match['diagnosis']} β confidence {match['confidence']:.0%}")
print(f" ICD-10: {match['icd10_codes']}")
print(f" Evidence: {match['evidence_pmids']}")
# Example output:
# Childhood absence epilepsy β confidence 92%
# ICD-10: ['G40.309']
# Evidence: ['PMID:12345', 'PMID:23456']Why it matters: Hallucination in AI-generated text most often manifests as plausible-looking but invalid citations. CitationAgent provides a medical-grade guard layer.
What it adds:
validate(citation_id)β single PMID/DOI validation with cachingvalidate_batch(ids)β parallel validation of 100+ papers- Impact scoring: log-scale citation count + journal IF + recency (0β100 composite)
- Retraction detection with notice text
- Access type classification (open / closed / hybrid)
- ORCID author disambiguation
- Missing metadata detection (title, year, DOI, authors)
- Validation statistics export
How to use:
from eeg_rag.agents.citation_agent.citation_validator import CitationValidator
import asyncio
validator = CitationValidator(use_mock=False) # use_mock=True for tests
# Single validation
result = await validator.validate("28215566")
print(f"Status: {result.status.value}")
print(f"Retracted: {result.is_retracted}")
print(f"Impact score: {result.impact_score.calculate_total():.1f}/100")
# Batch validation
results = await validator.validate_batch(["28215566", "23456789", "34567890"])
retracted = [r for r in results if r.is_retracted]
print(f"{len(retracted)} retracted papers flagged")Via pip (recommended):
# Core install
pip install eeg-rag
# With REST API server (FastAPI + uvicorn + SSE)
pip install "eeg-rag[api]"
# With Neo4j knowledge graph + Redis cache
pip install "eeg-rag[knowledge-graph]"
# Everything
pip install "eeg-rag[full]"Latest from GitHub:
pip install git+https://github.com/hkevin01/eeg-rag.gitClone for development:
git clone https://github.com/hkevin01/eeg-rag.git
cd eeg-rag
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"CLI commands (installed automatically with any method above):
eeg-rag --query "P300 in depression" # run a query
eeg-rag --health # check system health
eeg-rag --stats # show corpus statistics
eeg-rag-history # browse search history
eeg-rag-stats # detailed stats dashboardDocker:
docker build -f docker/Dockerfile -t eeg-rag:latest .
docker run -it --rm -v $(pwd)/data:/app/data -p 8080:8080 eeg-rag:latest# Copy environment template
cp .env.example .envEdit .env:
# Researcher identification (required for PubMed NCBI compliance)
RESEARCHER_EMAIL=your.email@university.edu
# Optional: speeds up ingestion significantly
NCBI_API_KEY=your_ncbi_key_here # https://www.ncbi.nlm.nih.gov/account/settings/
S2_API_KEY=your_s2_key_here # https://www.semanticscholar.org/product/api
# LLM for synthesis (optional β system works without it for retrieval)
OPENAI_API_KEY=sk-your-key-here
OPENAI_MODEL=gpt-3.5-turbo
# Vector store
FAISS_INDEX_PATH=data/embeddings/faiss_index
EMBEDDING_MODEL=microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
CHUNK_SIZE=512
CHUNK_OVERLAP=50
# Optional services
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
REDIS_HOST=localhost
REDIS_PORT=6379# Development (auto-reload)
uvicorn eeg_rag.api.main:app --reload --host 0.0.0.0 --port 8080
# Production (4 workers)
gunicorn eeg_rag.api.main:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8080| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with per-agent status |
/metrics |
GET | Performance metrics (latency, cache rate) |
/search |
POST | Standard search with AI synthesis |
/search/stream |
POST | SSE streaming β real-time progress |
/paper/details |
POST | Fetch full paper metadata |
/paper/citations |
POST | Citation network analysis |
/suggest |
GET | Query autocomplete |
/query-types |
GET | Available query categories |
/docs |
GET | Swagger UI |
/redoc |
GET | ReDoc documentation |
Note
Interactive docs available at http://localhost:8080/docs once the server is running. No API key required for retrieval-only queries.
π‘ Full curl Examples β Click to Expand
Standard search:
curl -X POST "http://localhost:8080/search" \
-H "Content-Type: application/json" \
-d '{"query": "deep learning EEG seizure detection", "max_results": 10, "synthesize": true}'Streaming search (SSE):
curl -N -X POST "http://localhost:8080/search/stream" \
-H "Content-Type: application/json" \
-d '{"query": "P300 amplitude in Alzheimer disease", "max_results": 5}'Paper details:
curl -X POST "http://localhost:8080/paper/details" \
-H "Content-Type: application/json" \
-d '{"pmid": "28215566"}'Health check:
curl http://localhost:8080/healthimport requests
# Simple search
response = requests.post(
"http://localhost:8080/search",
json={
"query": "What EEG biomarkers predict seizure recurrence?",
"max_results": 10,
"synthesize": True
}
)
result = response.json()
print(f"Found {result['total_found']} papers")
print(f"Answer: {result['synthesis']}")
for paper in result['papers'][:3]:
print(f" [{paper['pmid']}] {paper['title']} ({paper['year']})")Direct agent usage (without API server):
from eeg_rag.agents.pubmed_agent.pubmed_agent import PubMedAgent
from eeg_rag.utils.config import Config
import asyncio
config = Config.from_env()
agent = PubMedAgent(config=config)
async def search():
results = await agent.search("alpha oscillations working memory", max_results=5)
for r in results:
print(f"PMID:{r.pmid} β {r.title}")
asyncio.run(search())Multi-source parallel search (ResearchAgent):
from eeg_rag.agents.research_agent import ResearchAgent
from eeg_rag.agents.base_agent import AgentQuery
import asyncio
agent = ResearchAgent()
query = AgentQuery(text="theta oscillations memory encoding hippocampus")
result = await agent.execute(query)
print(f"Found {len(result.data['papers'])} unique papers across all sources")Verify citations:
from eeg_rag.verification.citation_verifier import CitationVerifier
import asyncio
verifier = CitationVerifier(config=Config.from_env())
async def verify():
result = await verifier.verify_pmid("28215566")
print(f"Valid: {result.is_valid}")
print(f"Title: {result.title}")
print(f"Retracted: {result.is_retracted}")
asyncio.run(verify())# Enhanced multi-agent Streamlit UI
streamlit run src/eeg_rag/web_ui/app_enhanced.py --server.port 8504Open http://localhost:8504 to see all 12 agents working in real-time.
| Agent | Role | What It Does |
|---|---|---|
| π― Orchestrator | Central Coordinator | Routes queries, manages workflow |
| π Query Planner | Query Analyst | Decomposes complexity, identifies entities |
| πΎ Local Search | Fast Retrieval | FAISS hybrid BM25+vector search (<100ms) |
| π₯ PubMed Search | Literature Gateway | MeSH-expanded queries, NCBI-compliant rates |
| π¬ Semantic Scholar | Citation Analysis | Influence scoring, citation network |
| πΈοΈ Knowledge Graph | Relationship Mapper | Neo4j entity resolution |
| β Citation Agent | Quality Assurance | PMID verification, retraction + impact scoring |
| π§ͺ Synthesis | Answer Generator | Multi-LLM ensemble summaries |
| π‘ MCP Agent | Tool Bridge | Exposes agents via Model Context Protocol |
| π Research Agent | Multi-Source Search | Parallel PubMed + S2 + Local with dedup |
| π Systematic Review | PRISMA Automation | Structured review pipeline with evidence grading |
| π©Ί Clinical Matching | EEG β Diagnosis | Patternβdiagnosis with ICD-10 + drug effects |
# Quick start: ~1,000 papers (5β10 min)
python scripts/run_ingestion.py --sources pubmed arxiv
# Standard: ~10,000 papers (1β2 hours)
python scripts/run_bulk_ingestion.py --pubmed 4000 --scholar 3000 --arxiv 1500 --openalex 1500
# EEG clinical trials + open-access literature (new sources)
python -m eeg_rag.ingestion.pipeline --sources clinicaltrials europe_pmc
# All six sources β bulk overnight: 120,000+ papers
python scripts/run_bulk_ingestion.py
# Resume an interrupted run
python scripts/run_bulk_ingestion.py --resumeEEG-RAG uses a metadata-first architecture: the repo stays under 50 MB, full abstracts are fetched on-demand and cached locally (~0.04 MB/paper).
| Source | ID Types | Best For | Rate (no key) | Rate (with key) |
|---|---|---|---|---|
| β PubMed | PMID | Medical / life sciences | 3 req/sec | 10 req/sec |
| β Semantic Scholar | DOI, PMID, arXiv | Citation data, CS/neuro | 20 req/min | 100 req/min |
| β arXiv | arXiv ID | Physics, CS, math preprints | ~20 papers/min | β |
| β OpenAlex | DOI, OpenAlex ID | Open metadata, broad coverage | 100K/day | β |
| β CrossRef | DOI | Authoritative DOI metadata | 50 req/sec | β |
| β bioRxiv / medRxiv | DOI (10.1101/*) | Life science preprints | 2 req/sec | β |
| β ClinicalTrials.gov | NCT ID | EEG clinical trials (epilepsy, BCI, sleep, neurofeedback) | REST v2, unlimited | β |
| β Europe PMC | PMID, PMCID | Open-access EEG literature with full-text XML | cursor-based, unlimited | β |
| β | Engineering (requires API key) | β | β |
ClinicalTrials.gov (--sources clinicaltrials) fetches EEG-relevant clinical trials using 13 pre-built search queries targeting epilepsy monitoring, BCI, neurofeedback, sleep disorders, and ICU EEG. Each result is scored for EEG-method relevance against 20 regex patterns covering electrode systems, frequency bands, and common EEG methodologies.
Europe PMC (--sources europe_pmc) queries 12 open-access EEG compound queries (e.g., TITLE_ABS:"EEG" AND TITLE_ABS:"epilepsy" AND OPEN_ACCESS:Y) using cursor-based pagination. Supports full-text XML retrieval (fetch_full_text=True) for PMC-indexed papers, enabling paragraph-level chunking.
from eeg_rag.ingestion.clinicaltrials_client import ClinicalTrialsClient
from eeg_rag.ingestion.europepmc_client import EuropePMCClient
# Stream EEG clinical trials
async with ClinicalTrialsClient() as client:
async for trial in client.search_eeg_trials(max_results=200):
print(f"{trial.nct_id}: {trial.brief_title} (EEG relevant: {trial.eeg_relevant})")
# Stream open-access EEG articles with full text
async with EuropePMCClient(fetch_full_text=True) as client:
async for article in client.search_eeg_articles(max_results=500, min_year=2015):
print(f"PMID:{article.pmid} PMC:{article.pmcid} OA:{article.is_open_access}")Warning
Always set RESEARCHER_EMAIL in .env before bulk ingestion. NCBI requires an identifying email for E-utilities API compliance.
| Technology | Purpose | Why Chosen | Alternatives Considered |
|---|---|---|---|
| Python 3.9+ | Core runtime | Rich ML/NLP ecosystem, async support, type hints | Node.js (lacks NLP maturity) |
| FastAPI | REST API framework | Async-native, auto OpenAPI docs, SSE support | Flask (no async), Django (heavier) |
| FAISS | Vector similarity search | <10ms for 1M vectors, GPU support, free | Pinecone (cloud/paid), Weaviate (heavier) |
| PubMedBERT | Biomedical embeddings | Pre-trained on 14M PubMed papers, 87% NER F1; selectable via model_preset parameter |
BioBERT (older), SciBERT (general science) |
| BM25 (rank-bm25) | Sparse keyword retrieval | Fast, no GPU, strong baseline for EEG terms | TF-IDF (less nuanced), Elasticsearch |
| SPLADE | Learned sparse retrieval | +10-15% recall over BM25, domain-aware | ANSERINI (less flexible) |
| Streamlit | Web UI | Rapid data science UI, no frontend expertise needed | React (more complex), Gradio |
| Neo4j | Knowledge graph | Cypher queries, multi-hop reasoning, visualization | ArangoDB (steeper curve), TigerGraph |
| Redis | Query cache | Sub-ms latency, TTL support, LRU eviction | Memcached (no persistence), DynamoDB |
| Pydantic v2 | Data validation | Type-safe models, fast validation at I/O boundaries | dataclasses (no validation), marshmallow |
| pytest + asyncio | Testing | Async test support, parametrize, 294+ tests passing | unittest (verbose), nose (deprecated) |
| Docker | Containerization | Reproducible builds, isolation, K8s-ready | Conda (Python-only), venv (no system deps) |
β‘ Performance Deep Dive β Click to Expand
| Method | Latency | Recall@10 | When to Use |
|---|---|---|---|
| BM25 baseline | ~20ms | 78% | Fast, exact-term queries |
| SPLADE learned sparse | ~40ms | 88% | Better quality needed |
| Dense (PubMedBERT) | ~30ms | 82% | Semantic / conceptual queries |
| Hybrid BM25 + Dense (RRF) | ~60ms | 91% | Best general baseline |
| Hybrid + Reranking | ~160ms | 95% | High-precision tasks |
| Scenario | Without Cache | With Cache | Speedup |
|---|---|---|---|
| Repeated query | 1.8s | 0.05s | 36x |
| Similar query | 1.8s | 1.8s | 1x |
| Popular EEG terms | 1.8s | 0.05s | 36x |
Target cache hit rate: >60% for common EEG research queries.
| Model | PubMed NER F1 | EEG Term Recall |
|---|---|---|
| BERT-base | 0.78 | 72% |
| BioBERT | 0.84 | 81% |
| PubMedBERT | 0.87 | 89% |
| SciBERT | 0.82 | 75% |
| Band | Frequency | Cognitive State | Clinical Relevance |
|---|---|---|---|
| Delta (Ξ΄) | 0.5β4 Hz | Deep sleep, unconsciousness | Tumor detection, encephalopathy |
| Theta (ΞΈ) | 4β8 Hz | Drowsiness, meditation | Memory encoding, ADHD markers |
| Alpha (Ξ±) | 8β13 Hz | Relaxed wakefulness | Eyes-closed resting state |
| Beta (Ξ²) | 13β30 Hz | Active thinking, focus | Anxiety, motor planning |
| Gamma (Ξ³) | 30β100 Hz | Cognitive processing, binding | Attention, consciousness |
| Component | Latency | Paradigm | Clinical Use |
|---|---|---|---|
| P300 | ~300ms | Oddball (target detection) | Working memory, BCI spellers |
| N400 | ~400ms | Semantic violation | Language disorders |
| N170 | ~170ms | Face stimulus | Face processing research |
| MMN | 150β250ms | Deviant auditory stimulus | Pre-attentive processing, schizophrenia |
| ERN | 50β100ms | Error response | Error monitoring, OCD |
from eeg_rag.nlp.ner_eeg import EEGNER
ner = EEGNER()
result = ner.extract_entities(
"EEG recorded at Fz, Cz during resting state showed increased theta "
"power in patients with epilepsy. P300 amplitude was reduced."
)
# β Electrodes: Fz, Cz
# β Experimental task: resting state
# β Frequency band: theta
# β Condition: epilepsy
# β Biomarker: P300π·οΈ All 12 NER Entity Categories β Click to Expand
| Entity Type | Term Count | Examples |
|---|---|---|
| Frequency Bands | 14 | delta (0.5-4Hz), theta, alpha, beta, gamma |
| Brain Regions | 40+ | frontal cortex, hippocampus, amygdala |
| Electrodes | 60+ | Fp1, Fz, Cz, Pz, O1, O2 (10-20 system) |
| Clinical Conditions | 50+ | epilepsy, Alzheimer's, depression, ADHD |
| Biomarkers | 40+ | P300, alpha asymmetry, theta-beta ratio |
| Measurement Units | 10+ | Hz, ΞΌV, ms, amplitude, power |
| Signal Features | 20+ | artifacts, epochs, phase, waveforms |
| Experimental Tasks | 30+ | resting state, oddball, motor imagery |
| Processing Methods | 35+ | ICA, FFT, bandpass filter |
| EEG Phenomena | 25+ | alpha blocking, sleep spindles |
| Cognitive States | 20+ | attention, drowsiness, meditation |
| Hardware | 15+ | EEG cap, amplifier, BioSemi |
flowchart LR
Q["User Query\n'P300 amplitude in depression'"]
EXP["Query Expansion\n141 EEG terms"]
S1["Stage 1: Primary Retrieval\nBM25 Β· SPLADE Β· Dense"]
RRF["Stage 2: RRF Fusion\nReciprocal Rank Fusion"]
RERANK["Stage 3: Cross-Encoder\nReranking (optional)"]
DOCS["Ranked Results\nwith PMID citations"]
Q --> EXP --> S1 --> RRF --> RERANK --> DOCS
style Q fill:#1a365d,color:#fff
style RERANK fill:#ca8a04,color:#fff
style DOCS fill:#15803d,color:#fff
The RRF score formula:
where
DenseRetriever now ships five built-in model presets selectable at construction time β no environment variables needed.
| Preset | Model | Best For |
|---|---|---|
general (default) |
sentence-transformers/all-MiniLM-L6-v2 |
Fast baseline, general text |
pubmedbert |
microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext |
Clinical and biomedical EEG text |
biobert |
dmis-lab/biobert-base-cased-v1.2 |
Biomedical NLP tasks |
mpnet |
sentence-transformers/all-mpnet-base-v2 |
High-quality general retrieval |
multilingual |
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
Non-English EEG literature |
from eeg_rag.retrieval.dense_retriever import DenseRetriever
# Switch to PubMedBERT for a clinical EEG corpus
retriever = DenseRetriever(
model_preset="pubmedbert",
collection_name="eeg_papers",
)
# Or use a custom HuggingFace model
retriever = DenseRetriever(
model_name="allenai/scibert_scivocab_uncased",
collection_name="eeg_papers",
)# Run retrieval evaluation on EEG benchmark queries
python examples/evaluate_reranking_improvements.pyMetrics computed: Recall@K, Precision@K, MRR (Mean Reciprocal Rank), NDCG@K, MAP (Mean Average Precision).
Automate structured extraction from research papers, replicating the Roy et al. 2019 EEG deep learning systematic review methodology.
from eeg_rag.review import SystematicReviewExtractor, ReproducibilityScorer
extractor = SystematicReviewExtractor(protocol="schemas/dl_eeg_review_2019_schema.yaml")
results_df = extractor.run(papers)
scorer = ReproducibilityScorer()
scored_df = scorer.score_dataset(results_df)
extractor.export("systematic_review.csv", format="csv")| Criterion | Score | Example |
|---|---|---|
| Public GitHub repo | 10 | https://github.com/author/repo |
| Code on request | 5 | "Available upon reasonable request" |
| Public dataset | 8 | CHB-MIT, PhysioNet, DEAP, TUSZ |
| Private/clinical dataset | 4 | Hospital EEG (ethics-approved) |
| Maximum | 18 | Fully reproducible research |
Powered by pyBiblioNet integration.
from eeg_rag.bibliometrics import EEGBiblioNet
from eeg_rag.bibliometrics.visualization import EEGResearchVisualizer
biblio = EEGBiblioNet(email="researcher@university.edu")
articles = biblio.search_eeg_literature(domain="brain_computer_interface", max_results=500)
visualizer = EEGResearchVisualizer(biblio)
visualizer.plot_publication_trends(articles, interval="year")
visualizer.plot_topic_evolution(articles, top_n=10)
visualizer.plot_top_authors(articles, num_authors=15, by_citations=True)Supported Research Domains: epilepsy, sleep, brain_computer_interface, cognitive, infant_neonatal, deep_learning, signal_processing.
π Full Bibliometrics API β Click to Expand
# Citation network analysis
citation_graph = biblio.build_citation_network(articles)
influential = biblio.get_influential_papers(citation_graph, metric="pagerank", top_n=20)
# Co-authorship analysis
coauthor_graph = biblio.build_coauthorship_network(articles)
communities = biblio.detect_communities(citation_graph, algorithm="louvain")
# NLP keyword extraction for RAG query enhancement
from eeg_rag.bibliometrics.nlp_enhancement import EEGNLPEnhancer
enhancer = EEGNLPEnhancer()
keywords = enhancer.extract_keywords(abstract, top_n=10)
enhanced_query = enhancer.enhance_query("P300 speller accuracy")
# Scopus-compatible export for meta-analysis tools
from eeg_rag.bibliometrics.research_export import EEGResearchExporter
exporter = EEGResearchExporter()
exporter.export_articles_to_scopus(articles, "eeg_papers.csv")
exporter.export_authors_to_csv(articles, "eeg_authors.csv")π‘οΈ Security, Compliance & IP Protection β Click to Expand
from eeg_rag.provenance import CitationProvenanceTracker
tracker = CitationProvenanceTracker(enable_opentimestamps=True)
tracker.record_retrieval(citation_id="PMID:12345678", ...)
report = tracker.export_provenance_report("PMID:12345678", format="markdown")- β SHA-256 hashing of all provenance events (immutable audit trail)
- β OpenTimestamps blockchain anchoring for IP protection / patent priority dates
- β FDA/CE marking compliance export
from eeg_rag.security import DatasetSecurityScanner
scanner = DatasetSecurityScanner(trusted_domains=["pubmed.ncbi.nlm.nih.gov"])
result = scanner.scan_text(document_content)
if not result.safe:
print(f"Threats detected: {result.threats}")- π‘οΈ SVG poisoning detection (embedded scripts, malicious payloads)
- π‘οΈ PDF malware scanning (JavaScript, auto-execute)
- π‘οΈ Prompt injection detection (AI manipulation attempts)
- π‘οΈ Domain whitelist enforcement
| Standard | Status | Notes |
|---|---|---|
| HIPAA | β Ready | Healthcare data protection (US) |
| GDPR | β Ready | Data protection (EU) |
| FDA 510(k) | π‘ Partial | Medical device clearance β documentation ready |
| CE Mark | π‘ Partial | European conformity β documentation ready |
Caution
The security scanner detects prompt injection attempts in retrieved documents. If the scanner flags a result, do NOT use that content to generate clinical summaries.
gantt
title EEG-RAG Development Roadmap
dateFormat YYYY-MM-DD
section Phase 1 β Foundation β
Architecture & BaseAgent :done, p1a, 2025-11-18, 3d
QueryPlanner & MemoryManager :done, p1b, 2025-11-19, 2d
Orchestrator :done, p1c, 2025-11-20, 1d
section Phase 2 β Agents β
LocalDataAgent (FAISS) :done, p2a, 2025-11-21, 1d
WebSearchAgent (PubMed) :done, p2b, 2025-11-22, 1d
GraphAgent (Neo4j) :done, p2c, 2025-11-23, 1d
CitationValidator :done, p2d, 2025-11-24, 1d
section Phase 3 β Pipeline β
Chunking Β· Corpus Β· Embeddings :done, p3a, 2025-11-22, 3d
NER System (400+ terms) :done, p3b, 2025-11-25, 2d
FinalAggregator + Hallucination:done, p3c, 2025-11-27, 1d
section Phase 4 β Ingestion β
Multi-Source Ingestion 120K+ :done, p4a, 2025-11-28, 3d
Streamlit Web UI :done, p4b, 2025-12-01, 2d
FastAPI + SSE (10 endpoints) :done, p4c, 2025-12-05, 3d
section Phase 5 β Advanced β
SPLADE + Reranker + IR Metrics :done, p5a, 2025-12-10, 4d
Bibliometrics + KeyBERT :done, p5b, 2025-12-15, 3d
Systematic Review Automation :done, p5c, 2025-12-18, 3d
section Phase 6 β New Agents & Sources β
ClinicalTrials.gov client :done, p6x, 2026-01-01, 3d
Europe PMC client :done, p6y, 2026-01-04, 3d
ResearchAgent (multi-source) :done, p6z, 2026-01-08, 3d
SystematicReviewAgent (PRISMA) :done, p6w, 2026-01-12, 3d
ClinicalMatchingAgent (ICD-10) :done, p6v, 2026-01-16, 3d
PubMedBERT model presets :done, p6u, 2026-01-20, 2d
section Phase 7 β Production π‘
Full LLM Integration :active, p7a, 2026-02-01, 14d
Performance Tuning (<2s p95) : p7b, 2026-02-15, 14d
Docker Production Build : p7c, 2026-03-01, 7d
K8s Deployment : p7d, 2026-03-08, 14d
| Phase | Goals | Status |
|---|---|---|
| Phase 1 β Foundation | Architecture, BaseAgent, QueryPlanner, Memory, Orchestrator | β 100% |
| Phase 2 β Agents | LocalSearch, PubMed, GraphAgent, CitationValidator | β 100% |
| Phase 3 β Pipeline | Chunking, NER, Corpus, Embeddings, FinalAggregator | β 100% |
| Phase 4 β Ingestion | Multi-source 120K papers, Streamlit UI, FastAPI | β 100% |
| Phase 5 β Advanced | SPLADE, Reranker, IR Metrics, Bibliometrics, Systematic Review | β 100% |
| Phase 6 β New Agents & Sources | ClinicalTrials.gov, Europe PMC, ResearchAgent, SystematicReviewAgent, ClinicalMatchingAgent, PubMedBERT presets | β 100% |
| Phase 7 β Production | Full LLM, <2s p95 target, Docker prod, K8s | π‘ 10% |
| Metric | Target | Current |
|---|---|---|
| Unit tests | >85% coverage | 330+ passing (100% pass rate) |
| Query latency p95 | < 2s | ~1.8s (local FAISS, no LLM) |
| Cache hit rate | > 60% | TBD (Redis optional) |
| Retrieval Recall@10 | > 90% | ~91% (Hybrid+RRF) |
| Citation precision | > 95% | 99%+ (PMID regex + PubMed validation) |
| System uptime | > 99.5% | Target |
| Data sources | 4 | 6 (+ ClinicalTrials.gov + Europe PMC) |
| Agent count | 8 | 12 (+ Research + SystematicReview + ClinicalMatching + Citation) |
π Overall Progress: ββββββββββββββββββββββββ ~95%
π§ͺ Tests: 330+ passing (100% pass rate)
π Code: 18,000+ lines production code
π₯ Data support: 120K+ papers (6 academic sources)
π API: 10 REST endpoints + SSE streaming
π¨ UI: 12 AI agents real-time visualization
π©Ί Clinical: EEG pattern β ICD-10 diagnosis mapping
π Reviews: PRISMA-compliant systematic review automation
pip install -r requirements-dev.txt# All offline tests (fast, no network)
source .venv/bin/activate
python -m pytest tests/ -m "not integration and not slow" -v
# With coverage report
python -m pytest tests/ --cov=eeg_rag --cov-report=html
# Integration tests (requires network)
python -m pytest tests/ -m integration -v
# New validation suite only
python -m pytest tests/test_search_validation.py \
tests/test_paper_authenticity.py \
tests/test_source_health.py -vPress Ctrl+C to stop the test runner at any time.
black src/ tests/ # Auto-format (88 char line length)
pylint src/eeg_rag # Lint check
mypy src/eeg_rag # Static type checking- Style: PEP 8 + Black (88 char limit)
- Type hints: All function signatures must be annotated
- Docstrings: Google-style with Args / Returns / Raises
- Testing: β₯85% coverage for
core/andagents/; 100% forverification/ - NASA-grade headers: All modules in
agents/,retrieval/,verification/carry structured ID/Requirement/Purpose/Rationale/Constraints/Failure-Modes headers
The README's operational claims are backed by an automated audit rather than manual proofreading alone. The audit checks the shipped corpus metadata for internal consistency, runs a hybrid retrieval smoke test to ensure ranked results still preserve metadata, and separates benchmark-only promises from locally verifiable facts.
Run the verifier with:
pytest tests/test_project_claims.py -qπ Contribution Workflow β Click to Expand
- Fork the repository on GitHub
- Clone your fork:
git clone https://github.com/YOUR_USERNAME/eeg-rag.git - Branch:
git checkout -b feature/your-feature-name - Implement with tests: ensure
pytest tests/passes - Format: run
black src/ tests/ - Commit:
git commit -m "feat: add XYZ with tests" - Push:
git push origin feature/your-feature-name - Pull Request: open a PR against
mainwith a description of changes
- All existing tests must pass
- New features need β₯85% coverage
- Type hints on all new functions
- Google-style docstring on all new public functions
- Update
CHANGELOG.mdentry
Open a GitHub Issue with:
- Python version and OS
- Steps to reproduce
- Full error traceback
- Expected vs. actual behavior
The package is now properly installable via pip:
pip install eeg-rag # core
pip install "eeg-rag[api]" # + FastAPI server
pip install "eeg-rag[knowledge-graph]" # + Neo4j + Redis
pip install "eeg-rag[full]" # everything
pip install git+https://github.com/hkevin01/eeg-rag.git # latest from GitHubFixed issues that previously prevented installable builds:
packages = ["eeg_rag"]was replaced withfinddiscovery β all 50 subpackages now bundle correctly- Entry point pointed to non-existent
eeg_rag.cli.query:main; fixed toeeg_rag.cli.main:main - Added missing core dependencies:
httpx,rank-bm25,anyio - Added
pytest-asynciotodevextras - Added
apiextra (fastapi,uvicorn,sse-starlette) separate fromknowledge-graph - Version bumped to
0.4.1
Three CLI commands are installed automatically:
| Command | Description |
|---|---|
eeg-rag |
Main query interface, health check, stats |
eeg-rag-history |
Browse and replay search history |
eeg-rag-stats |
Detailed corpus + system stats dashboard |
Injected safety-critical structured comment blocks above every class and function across all 178 source files (2,359 blocks total). Each block documents: ID, Requirement, Purpose, Rationale, Inputs, Outputs, Pre/Postconditions, Assumptions, Side Effects, Failure Modes, Error Handling, Constraints, Verification, and References.
All three Mermaid diagrams in the Architecture section were broken on GitHub. Fixed:
\nescape sequences in quoted node labels β<br/>HTML line breaksΒ·middle-dot characters and&in node names that crashed the Mermaid parser&-chained multi-target edges (A --> B & C) β individual edge lines- Special characters (
{,",<,>) in sequence diagram message text
Replaced the single-shot retrieve-then-answer pattern with a full agentic loop (src/eeg_rag/rag/agentic_rag.py).
The orchestrator now decides whether to retrieve at all, reformulates the query when the first results are insufficient, can issue multiple retrieval passes across different sources, and verifies retrieved content before generating a response.
Key pieces:
RetrievalDecisionMakerβ classifies every incoming query (RETRIEVE,SKIP,VERIFY_CLAIM,DECOMPOSE) using EEG-specific regex patterns so definitional and abbreviation questions are answered from model knowledge without wasting a retrieval round-trip.SufficiencyEvaluatorβ scores the retrieved document set on count, relevance, and semantic coverage; returns aSufficiencyStatusthat drives the retry decision.QueryReformulatorβ selects a non-repeating reformulation strategy (EXPAND,PIVOT_DENSE,PIVOT_BM25,RELAX,NARROW,DECOMPOSE) and rewrites the query via the existingEEGQueryExpander.AgenticRAGOrchestratorβ ties everything together; exposes bothrun()(returnsAgenticRAGResult) andstream()(async token generator).
Added a RAGAS-style automated evaluation framework (src/eeg_rag/evaluation/ragas_metrics.py) that measures four orthogonal quality axes:
| Metric | What it measures |
|---|---|
| Faithfulness | Fraction of answer claims supported by retrieved context (hallucination score) |
| Answer Relevance | Semantic similarity between the query and the answer |
| Context Precision | Average precision of the retrieved chunk ranking |
| Context Recall | Coverage of ground-truth documents or sentences |
Two evaluation modes: EMBEDDING (offline, no API key needed) and LLM (GPT-4 / Claude / Ollama as judge). AUTO mode selects LLM when an API key is present and falls back to embedding silently. export_for_human_eval() produces annotation-ready HumanEvalRecord dicts.
Several previously placeholder components now have real implementations:
- Orchestrator adaptive replanning β when a dependency chain is broken by a failed agent node, the orchestrator drops the blocking dependency and retries the stalled node rather than aborting the whole plan.
- Review extractor LLM backend (
review/extractor.py) β calls Ollama, OpenAI, or Anthropic depending onllm_backend; handles JSON-fenced responses and type coercion. Auto-fetches from PubMed when nopaperslist is supplied. - Corpus builder PubMed fetching (
rag/corpus_builder.py) β full NCBI E-utilities implementation with batch size, rate-limiting, exponential back-off, and PMID deduplication. - Citation validator production DB β new
PubMedValidationDatabasevalidates PMIDs (retraction detection via MeSH) and DOIs via CrossRef; activated byuse_mock=False. - Graph agent Neo4j driver β
use_mock=Falsewires up a realneo4j.AsyncDriverconnection. - MCP agent real client β
use_mock=Falseconnects to an MCP server via HTTP transport.
tests/test_agentic_rag.pyβ 49 teststests/test_ragas_metrics.pyβ 66 tests- Total suite: 1,386 passing
License: MIT β see LICENSE for full terms. You are free to use, modify, and distribute this software for research and commercial purposes with attribution.
| Resource | Contribution |
|---|---|
| Microsoft Research | PubMedBERT β biomedical embeddings pre-trained on 14M PubMed abstracts |
| Facebook AI Research | FAISS β billion-scale vector similarity search |
| NCBI / NIH | PubMed E-utilities API β unrestricted access to 35M+ citations |
| Semantic Scholar (AI2) | Citation graph API β influence scores and citation networks |
| EEG Research Community | Domain expertise, test corpora, and validation of terminology |
| Cormack et al. 2009 | Reciprocal Rank Fusion algorithm underlying hybrid retrieval |
| Wang et al. 2025 | EEG-MedRAG methodology β hypergraph retrieval for clinical EEG |
Built with β€οΈ for the EEG research community.
GitHub Β· Issues Β· Discussions Β· API Docs
Footnotes
-
F. Neha et al., "Retrieval-Augmented Generation (RAG) in Healthcare," AI, 2025. https://www.mdpi.com/2673-2688/6/9/226 β© β©2 β©3
-
Y. Wang et al., "EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation," arXiv:2508.13735, 2025. https://arxiv.org/abs/2508.13735 β© β©2