🧠 ALS-RAG

Retrieval-Augmented Generation for ALS Research Literature — evidence-grounded answers from PubMed, Semantic Scholar, arXiv, ClinicalTrials.gov, and Europe PMC.

Overview

ALS-RAG is a domain-specialised Retrieval-Augmented Generation (RAG) system for Amyotrophic Lateral Sclerosis (ALS) research. It aggregates scientific literature from five sources — PubMed, Semantic Scholar, arXiv, ClinicalTrials.gov, and Europe PMC — encodes it into a FAISS vector index with real BM25 hybrid scoring, and serves evidence-grounded answers through a Streamlit multi-page UI or a single-command CLI.

The problem it solves: ALS researchers and clinicians need rapid, citation-backed synthesis of a rapidly growing literature spanning genetics, biomarkers, clinical trials, and emerging therapeutics. General-purpose LLMs hallucinate domain-specific facts; ALS-RAG grounds every answer in real indexed papers and now includes a CitationVerificationAgent to automatically flag any sentences the LLM asserts that cannot be traced back to a retrieved source.

Audience: ALS clinical researchers, neurologists, PhD students studying motor neuron disease, and bioinformatics teams building systematic review pipelines.

Important

ALS-RAG is a research tool. It is not a clinical decision support system and must not be used for patient diagnosis or treatment without independent clinical validation.

(back to top ↑)

What's New in v2

Summary of Upgrades

_#	_Upgrade	_{Why Added}	_{What It Adds}	_{How to Use}
₁	_{Real BM25 hybrid retrieval (rank-bm25)}	_{Previous keyword weight was a stub — 100% dense-only retrieval despite the 0.3 keyword weight in config. Added to complete hybrid scoring and boost recall for exact ALS terminology (gene names, trial IDs, drug names).}	_{Integrates BM25Okapi from rank_bm25. Lazy-builds a BM25 index from the metadata JSON at first query. Fuses with dense cosine: score = 0.7×dense + 0.3×bm25_normalised.}	_{Automatic — no flags needed. Works on any als-rag query once the corpus is indexed.}
₂	_{ClinicalTrials.gov API v2}	_{ALS clinical trial eligibility, outcomes, and interventions are not in PubMed abstracts. Added to give the corpus structured trial data for intervention and eligibility queries.}	_{New ClinicalTrialsClient — fetches up to 200 ALS trials with NCT ID, phase, sponsor, eligibility criteria, primary outcomes, and intervention names. Cursor-paginated. No API key needed.}	_{make ingest (included by default) or make ingest-all}
₃	_{Europe PMC REST API}	_{40M+ biomedical articles not in PubMed including preprints, patents, and European clinical guidelines. 12 ALS-targeted queries sorted by citation count.}	_{New EuropePMCClient with 12 domain queries covering TDP-43, SOD1, C9orf72, NfL, ALSFRS-R, ALS-FTD, neuroinflammation, stem cells, gene therapy, and survival analysis. Cursor-mark pagination.}	_{make ingest (included by default)}
₄	_{PubMedBERT embedding support}	_{all-MiniLM-L6-v2 is domain-agnostic. neuml/pubmedbert-base-embeddings (768d) is fine-tuned on 29M PubMed abstracts and scores significantly higher on biomedical semantic similarity benchmarks (BIOSSES, MedSTS).}	_{EMBEDDING_MODEL and EMBEDDING_DIM are now env-var configurable. Switching to PubMedBERT requires one env change and a corpus rebuild.}	_{EMBEDDING_MODEL=neuml/pubmedbert-base-embeddings EMBEDDING_DIM=768 make clean ingest}
₅	_{CitationVerificationAgent}	_{LLMs can hallucinate facts not present in retrieved context. Added to provide an interpretable audit trail — showing which sentences in a generated answer are backed by which source passages, and flagging those that are not.}	_{Offline lexical overlap check (no extra API call). Splits the answer into sentences, computes domain-weighted claim-coverage overlap against all source chunks, returns per-claim verdicts, an overall coverage score (0–1), and a list of unsupported claims.}	_{als-rag "SOD1 prognosis" --verify or make verify Q="SOD1 prognosis"}
₆	_{ResearchAgent}	_{Previously the CLI called the retriever and generator directly — no unified interface for external code. Added to provide a single callable that runs the full RAG pipeline and returns a structured result.}	_{ResearchAgent.ask(query) returns a ResearchResult dataclass with .answer, .sources, .entities, .expanded_queries, and .format_citation_list().}	_{from als_rag.agents import ResearchAgent}
₇	_{IngestionAgent}	_{Multi-source ingestion previously required manual client calls with no progress reporting or error isolation per source.}	_{Orchestrates all 5 sources, deduplicates by title prefix, calls the ingestion pipeline once, returns an IngestionReport with per-source article counts and total chunks indexed. Accepts an on_progress callback.}	_{make ingest or --sources pubmed,clinicaltrials}
₈	_{ClinicalMatchingAgent}	_{Clinical record → literature query conversion needed a unified agent to handle EMG/FVC/ALSFRS-R input, phenotype classification, longitudinal progression calculation, and retrieval in one call.}	_{ClinicalMatchingAgent.match(clinical_record) converts a structured dict (ALSFRS-R, FVC, genetics, EMG, onset) into a phenotype-specific retrieval query, runs hybrid retrieval, and generates a clinical evidence summary. Computes ALSFRS-R progression rate from longitudinal series if provided.}	_{from als_rag.agents import ClinicalMatchingAgent}
₉	_{SystematicReviewAgent}	_{Single-query answering misses the breadth needed for systematic literature synthesis. Six structured sub-angles (epidemiology, pathophysiology, clinical trials, biomarkers, treatment, genetics) are needed per topic.}	_{Runs 6 sub-queries per topic, merges and deduplicates up to 20 sources, tallies NER entity type counts, and synthesises a structured evidence summary with Background / Evidence Summary / Evidence Gaps / Clinical Implications sections via a custom generation prompt.}	_{als-rag --review "tofersen SOD1 ALS" or make review T="tofersen SOD1"}

`.env` New Variables

# PubMedBERT embedding switch (optional — rebuild corpus after changing)
EMBEDDING_MODEL=neuml/pubmedbert-base-embeddings
EMBEDDING_DIM=768

Warning

Changing EMBEDDING_MODEL requires make clean && make ingest to rebuild the FAISS index with the correct dimensionality. Mixing embedding models in one index causes incorrect similarity scores.

(back to top ↑)

Agents

ALS-RAG v2 includes five autonomous agents that each orchestrate a distinct research workflow. All agents lazy-load their dependencies (retriever, generator, NER, expander) on first use.

_Agent	_Class	_Focus	_{How It Works}	_{Key Output}
_{ResearchAgent}	_{als_rag.agents.ResearchAgent}	_{End-to-end Q&A with source attribution}	_{Expands query → hybrid retrieval → NER annotation of top-3 sources → GPT-4o-mini generation with ALS expert prompt}	_{ResearchResult with .answer, .sources, .entities, .expanded_queries, .format_citation_list()}
_{IngestionAgent}	_{als_rag.agents.IngestionAgent}	_{Multi-source corpus refresh}	_{Fetches from all 5 sources sequentially, catches per-source errors independently, deduplicates by title prefix (80 chars), calls ALSIngestionPipeline.ingest() once with all unique articles}	_{IngestionReport with per-source article counts, total chunks indexed, and error list}
_{ClinicalMatchingAgent}	_{als_rag.agents.ClinicalMatchingAgent}	_{Case-based literature from clinical records}	_{Extracts features via ALSFeatureExtractor, classifies onset phenotype (bulbar / limb / respiratory / generalised), optionally computes ALSFRS-R progression rate from longitudinal series, forms a rich retrieval query, runs hybrid retrieval, generates clinical evidence summary}	_{ClinicalMatchResult with .onset_phenotype, .progression_rate, .features_description, .sources, .answer, .summary()}
_{SystematicReviewAgent}	_{als_rag.agents.SystematicReviewAgent}	_{Structured evidence synthesis}	_{Runs 6 sub-queries per topic (epidemiology, pathophysiology, clinical trials, biomarkers, treatment, genetics), merges results by max-score dedup (cap 20 sources), counts NER entity types across all sources, generates a 4-section structured synthesis}	_{SystematicReviewResult with .synthesis, .sources, .entity_counts, .sub_queries_run, .format_entity_summary(), .format_source_table()}
_{CitationVerificationAgent}	_{als_rag.agents.CitationVerificationAgent}	_{Hallucination detection / citation audit}	_{Splits answer into sentences, tokenises claim words (stop words removed), computes claim-coverage overlap against every source chunk, marks each sentence supported/unsupported, aggregates coverage score}	_{CitationVerificationResult with .coverage_score, .claims, .unsupported, .flagged, .report()}

Agent Quick Reference

from als_rag.agents import (
    ResearchAgent,
    IngestionAgent,
    ClinicalMatchingAgent,
    SystematicReviewAgent,
    CitationVerificationAgent,
)

# --- 1. Research Q&A ---
agent = ResearchAgent()
result = agent.ask("What is the prognostic value of NfL in ALS?")
print(result.answer)
print(result.format_citation_list())

# --- 2. Ingest corpus ---
ingestor = IngestionAgent(sources=["pubmed", "clinicaltrials", "europepmc"])
report = ingestor.run(on_progress=print)
print(report.summary())

# --- 3. Clinical matching ---
clinical_agent = ClinicalMatchingAgent()
match = clinical_agent.match({
    "alsfrs_r_total": 36,
    "fvc_percent_predicted": 68,
    "c9orf72_repeat": True,
    "denervation_regions": ["bulbar", "cervical"],
    "alsfrs_r_series": [48, 44, 40, 36],
    "alsfrs_r_times_months": [0, 1, 2, 3],
})
print(match.summary())
print(match.answer)

# --- 4. Systematic review ---
reviewer = SystematicReviewAgent()
review = reviewer.review("tofersen SOD1 ALS antisense")
print(review.synthesis)
print(review.format_entity_summary())

# --- 5. Citation verification ---
verifier = CitationVerificationAgent()
vresult = verifier.verify(result.answer, result.sources)
print(vresult.report())

# Or combine directly with research result:
vresult = verifier.verify_from_research_result(result)
print(f"Coverage: {vresult.coverage_score:.0%}  Flagged: {vresult.flagged}")

(back to top ↑)

Key Features

_Icon	_Feature	_Description	_Status
_📥	_{5-source ingestion}	_{PubMed (18+ queries), Semantic Scholar (10), arXiv (7), ClinicalTrials.gov (200 trials), Europe PMC (12 queries)}	_{✅ Stable}
_🔍	_{Real BM25 hybrid retrieval}	_{BM25Okapi fused with dense cosine: 0.7×dense + 0.3×bm25 — no longer a stub}	_{✅ Stable}
_🤖	_{5 autonomous agents}	_{Research, Ingestion, ClinicalMatching, SystematicReview, CitationVerification}	_{✅ Stable}
_✅	_{Citation verification}	_{Sentence-level hallucination detection with claim-coverage scores, no extra API call}	_{✅ Stable}
_🔬	_{Domain NER}	_{Gene, biomarker, drug, scale, subtype + numeric measurement extraction}	_{✅ Stable}
_🧬	_{Clinical signal matching}	_{ALSFRS-R, FVC, EMG, onset phenotype, progression rate → literature queries}	_{✅ Stable}
_📖	_{Systematic review synthesis}	_{6-angle sub-query retrieval with 4-section structured evidence synthesis}	_{✅ Stable}
_🖥️	_{Streamlit web UI}	_{Search, Corpus Stats, Clinical Features, About pages}	_{✅ Stable}
_⌨️	_CLI	_{--ingest, --review, --verify, --sources, --top-k, --no-generate}	_{✅ Stable}
_🔠	_{PubMedBERT support}	_{Switch to 768d domain-tuned embeddings via two env vars}	_{⚙️ Configurable}

(back to top ↑)

Architecture

System Architecture

flowchart TD
    subgraph Sources["📚 Literature Sources (v2: 5 total)"]
        PM[PubMed\n18+ queries]
        SS[Semantic Scholar\n10 queries]
        AX[arXiv\n7 queries]
        CT[ClinicalTrials.gov\nAPI v2 · 200 trials]
        EP[Europe PMC\n12 ALS queries]
    end

    subgraph Ingestion["⚙️ IngestionAgent"]
        FETCH[Fetch & Deduplicate]
        CHUNK[Chunk 512w / 64 overlap]
        NER[ALS NER Extractor\nGENE · BIOMARKER · TREATMENT]
        EMBED[SentenceTransformer\nEmbedding]
        NORM[L2 Normalize]
    end

    subgraph Storage["💾 Storage"]
        FAISS[(FAISS IndexFlatIP)]
        META[(JSON Metadata)]
    end

    subgraph Agents["🤖 Agent Layer"]
        RA[ResearchAgent\nQ&A Pipeline]
        CMA[ClinicalMatchingAgent\nCase-based Retrieval]
        SRA[SystematicReviewAgent\n6-angle Synthesis]
        CVA[CitationVerificationAgent\nHallucination Guard]
    end

    subgraph Retrieval["🔍 HybridRetriever"]
        QE[ALSQueryExpander\nSynonym expansion]
        DENSE[Dense Cosine\nFAISS top-k]
        BM25[BM25Okapi\nLexical top-k]
        FUSE[Score Fusion\n0.7×dense + 0.3×bm25]
    end

    subgraph Generation["💬 ALSGenerator"]
        GPT[GPT-4o-mini\nALS Expert Prompt]
    end

    subgraph UI["🖥️ Interfaces"]
        WEB[Streamlit UI\n4 pages]
        CLI[CLI · als-rag]
    end

    PM & SS & AX & CT & EP --> FETCH --> CHUNK --> NER
    CHUNK --> EMBED --> NORM --> FAISS
    NER --> META

    CLI & WEB --> RA & CMA & SRA
    RA & CMA & SRA --> QE --> DENSE & BM25
    FAISS --> DENSE
    META --> BM25
    DENSE & BM25 --> FUSE --> GPT --> CVA
    CVA --> CLI & WEB

Data flow: Literature is fetched from five academic sources, chunked into 512-word passages, embedded, L2-normalised, and stored in FAISS. At query time agents expand the query, run dual dense+BM25 retrieval, fuse scores, and pass context to GPT-4o-mini. The CitationVerificationAgent can then audit any generated answer offline.

(back to top ↑)

Usage Flow

Query + Verification Sequence

sequenceDiagram
    actor User
    participant CLI
    participant RA as ResearchAgent
    participant HR as HybridRetriever
    participant FAISS
    participant BM25 as BM25Okapi
    participant GEN as ALSGenerator
    participant CVA as CitationVerificationAgent

    User->>CLI: als-rag "NfL prognosis ALS" --verify
    CLI->>RA: ask("NfL prognosis ALS")
    RA->>HR: retrieve(expanded_queries, top_k=8)
    HR->>FAISS: dense cosine search
    HR->>BM25: keyword score lookup
    HR-->>RA: fused ranked sources
    RA->>GEN: generate(query, sources)
    GEN-->>RA: answer text
    RA-->>CLI: ResearchResult
    CLI->>CVA: verify(answer, sources)
    CVA-->>CLI: CitationVerificationResult
    CLI-->>User: Answer + Sources + Entity list + Citation Coverage report

(back to top ↑)

ALS Domain Coverage

Literature Source Distribution (v2)

pie title ALS Corpus Source Distribution
    "PubMed (18+ queries)" : 35
    "Europe PMC (12 queries)" : 25
    "Semantic Scholar (10 queries)" : 20
    "ClinicalTrials.gov (trials)" : 12
    "arXiv (7 queries)" : 8

_Category	_Examples	_{Entity Type}
_Genes	_{SOD1, C9orf72, FUS, TARDBP, TBK1, NEK1, UBQLN2, VCP, OPTN}	_GENE
_Biomarkers	_{Neurofilament light (NfL), TDP-43, phospho-NfH, YKL-40, CK, IL-6}	_BIOMARKER
_{Clinical scales}	_{ALSFRS-R, FVC, King's staging, MiToS, El Escorial, Awaji criteria}	_{CLINICAL_SCALE}
_Treatments	_{Riluzole, Edaravone, Tofersen/BIIB067, AMX0035, ASOs, gene therapy}	_TREATMENT
_Phenotypes	_{Bulbar onset, limb onset, ALS-FTD, PLS, PMA, familial ALS}	_{ALS_SUBTYPE}
_{Trial data}	_{NCT IDs, eligibility criteria, primary outcomes, phases}	_{(from ClinicalTrials.gov)}

(back to top ↑)

Technology Stack

_Technology	_Purpose	_{Why Chosen}
_{sentence-transformers/all-MiniLM-L6-v2}	_{Dense embedding (384d, default)}	_{Fast, strong general semantic similarity}
_{neuml/pubmedbert-base-embeddings}	_{Dense embedding (768d, optional)}	_{Fine-tuned on 29M PubMed abstracts; higher biomedical similarity scores}
_{FAISS IndexFlatIP}	_{Exact cosine nearest-neighbor search}	_{Zero approximation error on L2-normalised vectors}
_{rank-bm25 (BM25Okapi)}	_{Lexical keyword retrieval}	_{Completes hybrid scoring; excellent at exact ALS terms (gene names, trial IDs)}
_{OpenAI GPT-4o-mini}	_{Evidence-grounded generation}	_{Cost-effective, strong instruction following}
_{ClinicalTrials.gov REST API v2}	_{ALS trial ingestion}	_{Structured eligibility, outcomes, and intervention data unavailable in PubMed}
_{Europe PMC REST API}	_{Broad biomedical literature}	_{40M+ articles, preprints, patents; sorted by citation count}
_Streamlit	_{Multi-page web UI}	_{Rapid prototyping, minimal frontend code}
_{PubMed E-utilities}	_{Primary literature source}	_{Gold-standard biomedical citations}
_{Semantic Scholar API}	_{Academic coverage}	_{Citation graph enrichment, open access}
_{arXiv REST API}	_{Preprint source}	_{Latest research, open access}
_{pytest + pytest-mock}	_{Testing (57 tests)}	_{Industry-standard with rich mock ecosystem}
_uv	_{Package management}	_{10–100× faster than pip, reproducible installs}
_ruff	_Linting	_{Extremely fast, replaces flake8 + isort}
_mypy	_{Type checking}	_{Static analysis on all source types}

(back to top ↑)

Setup & Installation

Prerequisites

Python 3.9 or higher
uv package manager (pip install uv)
OpenAI API key (required for answer generation)
PubMed contact email (required by NCBI ToS)

Installation

# 1. Clone the repository
git clone https://github.com/hkevin01/als-rag.git
cd als-rag

# 2. Configure environment variables
cp .env.example .env
# Edit .env with your API keys

.env configuration:

# Required
OPENAI_API_KEY=sk-...
CONTACT_EMAIL=you@domain.com

# Optional — raises rate limits
PUBMED_API_KEY=
SEMANTIC_SCHOLAR_API_KEY=

# Optional — override model
OPENAI_MODEL=gpt-4o-mini

# Optional — PubMedBERT domain embeddings (requires corpus rebuild)
# EMBEDDING_MODEL=neuml/pubmedbert-base-embeddings
# EMBEDDING_DIM=768

# 3. Install all dependencies
make install

# 4. Ingest ALS literature corpus (all 5 sources, ~10 minutes)
make ingest

# 5. Launch the Streamlit web UI
make run
# → Opens at http://localhost:8501

Tip

Run make ingest once to bootstrap the FAISS index. Subsequent queries are sub-second — the index persists at data/embeddings/als_faiss.index.

(back to top ↑)

Usage

CLI Reference

# Standard research query
als-rag "What is the efficacy of tofersen in SOD1-ALS?"

# With citation verification — flags any unsupported LLM sentences
als-rag "C9orf72 repeat expansion cognitive decline" --verify

# Retrieval only (no OpenAI call)
als-rag "AMX0035 clinical trial results" --no-generate

# Systematic review — runs 6 sub-queries and synthesises
als-rag --review "neurofilament light chain ALS prognosis"

# Ingest from all 5 sources
als-rag --ingest

# Ingest from specific sources only
als-rag --ingest --sources pubmed,clinicaltrials,europepmc

# Ingest then query
als-rag --ingest "SOD1 antisense oligonucleotide survival"

# Increase retrieved context window
als-rag "ALSFRS-R domain scoring" --top-k 15

# Verbose mode — shows embedding, FAISS, and BM25 debug logs
als-rag "TDP-43 aggregation pathway" --verbose

Makefile Reference

make install          # Install package in editable mode with dev extras
make ingest           # Ingest from all 5 sources
make ingest-all       # Explicit alias for make ingest
make run              # Launch Streamlit UI on port 8501
make query Q="..."    # CLI research query
make review T="..."   # Systematic mini-review on topic T
make verify Q="..."   # Research query + citation verification
make test             # Run full pytest suite (57 tests)
make lint             # Ruff linting on src/ and tests/
make format           # Black auto-formatting
make typecheck        # mypy static type checking
make clean            # Delete FAISS index + metadata (forces re-ingest)

Web UI

Launch with make run and navigate using the sidebar:

_Page	_Description
_Search	_{Enter a research question — get an AI-generated answer with source citations}
_{Corpus Stats}	_{Indexed document counts, source breakdown, year distribution}
_{Clinical Features}	_{Input ALSFRS-R, FVC, onset, genetics → case-matched literature retrieval}
_About	_{System information, model details, data sources}

Project Structure

als-rag/
├── src/als_rag/
│   ├── agents/                     # Five autonomous agents
│   │   ├── research_agent.py       # ResearchAgent — full Q&A pipeline
│   │   ├── ingestion_agent.py      # IngestionAgent — 5-source corpus refresh
│   │   ├── clinical_agent.py       # ClinicalMatchingAgent — case-based retrieval
│   │   ├── review_agent.py         # SystematicReviewAgent — 6-angle synthesis
│   │   └── citation_agent.py       # CitationVerificationAgent — hallucination guard
│   ├── ingestion/                  # Literature source clients
│   │   ├── pubmed_client.py        # PubMed E-utilities
│   │   ├── scholar_client.py       # Semantic Scholar API
│   │   ├── arxiv_client.py         # arXiv REST API
│   │   ├── clinicaltrials_client.py  # ClinicalTrials.gov API v2  [NEW]
│   │   ├── europepmc_client.py     # Europe PMC REST API          [NEW]
│   │   ├── ner_extractor.py        # ALS domain NER
│   │   └── pipeline.py             # Chunk → embed → index pipeline
│   ├── retrieval/
│   │   ├── hybrid_retriever.py     # BM25Okapi + dense fusion     [UPGRADED]
│   │   ├── dense_retriever.py      # FAISS cosine retrieval
│   │   └── query_expander.py       # ALS synonym expansion
│   ├── generation/
│   │   └── generator.py            # GPT-4o-mini with ALS system prompt
│   ├── signals/
│   │   └── als_matcher.py          # ALSFeatureExtractor + phenotype classifier
│   ├── storage/
│   │   └── vector_db.py            # FAISS IndexFlatIP wrapper
│   ├── utils/
│   │   └── config.py               # Config with env-var embedding switch [UPGRADED]
│   └── web_ui/                     # Streamlit app + 4 pages
├── tests/
│   ├── test_agents.py              # 42 agent + BM25 + client tests [NEW]
│   ├── test_als_matcher.py
│   ├── test_ner_als.py
│   └── test_query_expander.py
├── Makefile
└── pyproject.toml

(back to top ↑)

Core Capabilities

📥 Ingestion Pipeline (5 Sources)

IngestionAgent coordinates all five sources:

PubMed — 18+ targeted ALS queries via NCBI E-utilities; free with email, 10× higher rate limit with API key
Semantic Scholar — 10 ALS queries covering genetics, biomarkers, and trials; citation-enriched metadata
arXiv — 7 queries for ALS preprints (machine learning drug discovery, structural bioinformatics)
ClinicalTrials.gov — Up to 200 ALS trials fetched from the v2 REST API; provides eligibility criteria, primary outcomes, interventions, and phase data not available in PubMed
Europe PMC — 12 domain-specific queries sorted by citation count; covers preprints, patents, and European clinical data

Note

Run make clean && make ingest to rebuild from scratch. Re-running make ingest without clean will add new articles incrementally (MD5 dedup by PMID/DOI/title).

🔍 Hybrid Retrieval — Real BM25 (v2 Upgrade)

HybridRetriever now implements full BM25Okapi scoring:

hybrid_score = 0.7 × cosine_similarity + 0.3 × bm25_normalised

Dense path: sentence-transformers embedding → FAISS IndexFlatIP cosine search
BM25 path: BM25Okapi (rank-bm25) lazy-built from metadata JSON on first query; tokenises chunk_text + title; normalises scores to 0–1 before fusion
Why it matters: BM25 strongly recovers exact matches for ALS-specific terms (e.g., NCT03070951, BIIB067, AMX0035) that dense models may embed similarly to paraphrases

⚙️ Tuning hybrid weights

Override at instantiation:

from als_rag.retrieval.hybrid_retriever import HybridRetriever
# More lexical weight for gene/drug name precision
retriever = HybridRetriever(dense_weight=0.5)

✅ Citation Verification (New in v2)

CitationVerificationAgent provides an interpretable hallucination guard:

Splits the generated answer into sentences using sentence-boundary detection
Removes stop words from both claim and source token sets to emphasise ALS domain terms
Computes claim-coverage overlap: |claim_tokens ∩ source_tokens| / |claim_tokens|
Marks a claim as supported if its best source overlap ≥ threshold (default 0.12)
Flags the entire result if coverage_score < 0.50 (configurable)

verifier = CitationVerificationAgent(
    support_threshold=0.12,      # Per-claim overlap to count as "supported"
    coverage_flag_threshold=0.50 # Whole-answer flag if < 50% claims supported
)

Example output:

Citation Coverage: 83%  (PASS)
Claims checked: 6  |  Supported: 5  |  Unsupported: 1

  [ 1] ✅ Supported  [0.41] — Tofersen SOD1 ALS Trial 2022
       › Tofersen reduces SOD1 mRNA in patients with ALS.
  [ 2] ✅ Supported  [0.29] — VALOR trial results 2022
       › The VALOR trial showed a 55% NfL reduction over 28 weeks versus placebo.
  [ 6] ⚠️  Unsupported [0.04]
       › Tofersen is currently approved in 47 countries worldwide.

━ Unsupported claims (potential hallucinations) ━
  ⚠  Tofersen is currently approved in 47 countries worldwide.

🔬 Domain NER

ALSNERExtractor applies vocabulary-based rule matching across 5 entity categories and regex extraction for numeric clinical measurements.

📋 Full NER Entity Vocabulary

Genes (24): SOD1, C9orf72, FUS, TARDBP/TDP-43, UBQLN2, VCP, OPTN, TBK1, SQSTM1, HNRNPA1, HNRNPA2B1, MATR3, TUBA4A, NEK1, KIF5A, SETX, ALS2/ALSIN, DCTN1, CHMP2B, ANG, VEGF, NEFH, PRPH

Biomarkers (20+): NfL, pNfH, TDP-43, YKL-40, CK, uric acid, creatinine, IL-6, TNF-alpha, MCP-1, miR-206, miR-133, miR-9

Clinical Scales (12): ALSFRS-R, FVC, SVC, ATLIS, SNP, MRC scale, El Escorial, Awaji criteria, Gold Coast criteria, grip strength, King's staging, MiToS staging

Treatments (18+): Riluzole, Edaravone, Tofersen/BIIB067, AMX0035, sodium phenylbutyrate, TUDCA, rasagiline, mexiletine, baclofen, NIV/BiPAP, PEG, ASO, gene therapy, stem cell, iPSC

Subtypes (10): Bulbar onset, limb onset, flail arm, flail leg, PLS, PMA, ALS-FTD, familial ALS, sporadic ALS, juvenile ALS

🧬 Clinical Signal Integration

ClinicalMatchingAgent converts a structured clinical record into a phenotype-specific retrieval query, classifies onset using Awaji criteria, and optionally computes ALSFRS-R progression rate from longitudinal data.

🏥 Supported clinical input fields

_Field	_Type	_Description
_{alsfrs_r_total}	_int	_{Total ALSFRS-R score (0–48)}
_{alsfrs_r_slope}	_float	_{Points/month decline}
_{alsfrs_r_series + alsfrs_r_times_months}	_list	_{Longitudinal series for rate calculation}
_{fvc_percent_predicted}	_float	_{FVC % predicted}
_{c9orf72_repeat}	_bool	_{C9orf72 hexanucleotide repeat expansion}
_{denervation_regions}	_list	_{EMG regions: ["bulbar","cervical","thoracic","lumbar"]}
_{alsfrs_r_slope}	_float	_{pts/month (negative = decline)}
_{cognitive_impairment}	_bool	_{Cognitive impairment flag for ALS-FTD query}

🤖 Evidence-Grounded Generation

ALSGenerator uses GPT-4o-mini with an ALS expert system prompt covering genetics, biomarkers, clinical scales, treatments, phenotypes, and pathophysiology. The model is instructed to:

Cite source titles and years for all specific claims
State when context is insufficient rather than speculate
Never output advice for individual patient management

Warning

If OPENAI_API_KEY is not set, the system runs in retrieval-only mode (ranked sources, no generated answer).

⚙️ Advanced Configuration

_Variable	_Default	_Description
_{OPENAI_MODEL}	_gpt-4o-mini	_{Override generation model}
_{EMBEDDING_MODEL}	_{sentence-transformers/all-MiniLM-L6-v2}	_{Embedding model}
_{EMBEDDING_DIM}	₃₈₄	_{Must match chosen model}
_{Config.chunk_size}	₅₁₂	_{Words per chunk}
_{Config.chunk_overlap}	₆₄	_{Overlap words between chunks}
_{Config.default_top_k}	₁₀	_{Retrieved passages per query}
_{HybridRetriever(dense_weight=)}	_0.7	_{Dense vs. BM25 balance}
_{Config.openai_temperature}	_0.2	_{Lower = more factual}
_{CitationVerificationAgent(support_threshold=)}	_0.12	_{Claim overlap minimum}
_{CitationVerificationAgent(coverage_flag_threshold=)}	_0.50	_{Flag below this coverage}

(back to top ↑)

Roadmap

gantt
    title ALS-RAG Development Roadmap
    dateFormat  YYYY-MM-DD
    section Foundation
        Core RAG pipeline               :done,    f1, 2025-01-01, 2025-03-01
        PubMed / Scholar ingestion      :done,    f2, 2025-03-01, 2025-04-01
        FAISS vector store              :done,    f3, 2025-04-01, 2025-05-01
    section v1 Features
        Streamlit UI                    :done,    f4, 2025-05-01, 2025-06-01
        ALS NER + clinical signals      :done,    f5, 2025-06-01, 2025-08-01
        Hybrid retrieval (stub)         :done,    f6, 2025-08-01, 2025-09-01
    section v2 Upgrades
        Real BM25 hybrid (rank-bm25)    :done,    v1, 2026-01-01, 2026-02-01
        ClinicalTrials.gov + EuropePMC  :done,    v2, 2026-01-01, 2026-02-15
        5 autonomous agents             :done,    v3, 2026-02-01, 2026-03-15
        CitationVerificationAgent       :done,    v4, 2026-03-01, 2026-04-01
    section Enhancement
        Knowledge graph (Neo4j)         :active,  e1, 2026-04-01, 2026-09-01
        Full-text PDF ingestion         :         e2, 2026-05-01, 2026-08-01
        Streamlit review page           :         e3, 2026-04-01, 2026-06-01
    section Scale
        Redis caching layer             :         s1, 2026-08-01, 2026-11-01
        Multi-disease extension (MS/PD) :         s2, 2026-10-01, 2027-02-01

(back to top ↑)

Development Status

_Component	_Version	_Stability	_Tests	_Notes
_{IngestionAgent (5 sources)}	_0.2.0	_Beta	_{✅ Unit}	_{Per-source error isolation}
_{ClinicalTrialsClient}	_0.2.0	_Beta	_{✅ Unit (mocked)}	_{No API key required}
_{EuropePMCClient}	_0.2.0	_Beta	_{✅ Unit (mocked)}	_{12 ALS queries}
_{HybridRetriever (BM25)}	_0.2.0	_Beta	_{✅ Unit}	_{Real BM25Okapi fusion}
_{ResearchAgent}	_0.2.0	_Beta	_{✅ Unit}	_{Full pipeline in one call}
_{ClinicalMatchingAgent}	_0.2.0	_Beta	_{✅ Unit}	_{Longitudinal progression rate}
_{SystematicReviewAgent}	_0.2.0	_Beta	_{✅ Unit}	_{6 sub-queries per topic}
_{CitationVerificationAgent}	_0.2.0	_Beta	_{✅ Unit}	_{Offline, no API call needed}
_{ALS NER Extractor}	_0.1.0	_Alpha	_{✅ Unit}	_{Rule-based only}
_{FAISS Vector DB}	_0.1.0	_Alpha	_Integration	_{Flat index}
_ALSGenerator	_0.1.0	_Alpha	_Mocked	_{Requires OpenAI key}
_{Streamlit UI}	_0.1.0	_Alpha	_Manual	_Single-user
_{Total test coverage}	_—	_—	_{57 passing}	_—

make test         # 57 tests across 4 test files
make typecheck    # mypy static analysis
make lint         # Ruff code quality

(back to top ↑)

Contributing

Contributions are welcome! Please follow this workflow:

Fork the repository on GitHub
Create a feature branch: git checkout -b feature/your-feature-name
Commit using Conventional Commits: feat:, fix:, docs:, test:
Push and open a Pull Request against main

📐 Development Guidelines

Code style:

Formatting: black (line length 88)
Linting: ruff
Type hints: required on all new code; checked with mypy
Import order: isort (enforced via ruff)

Testing:

Add pytest tests for all new features under tests/
Mock external API calls (PubMed, ClinicalTrials.gov, OpenAI) using pytest-mock
All agents should have unit tests with generate=False to avoid OpenAI calls

Domain contributions:

Expanding ALS_GENES, ALS_BIOMARKERS, or ALS_SYNONYMS — cite source paper in PR description
New literature queries — document clinical rationale

# Full pre-commit check
make lint && make typecheck && make test

(back to top ↑)

License & Acknowledgements

License: MIT — see LICENSE for full terms.

Architecture adapts patterns from eeg-rag.

Data sources:

NCBI PubMed — National Library of Medicine
Semantic Scholar — Allen Institute for AI
arXiv.org — Cornell University
ClinicalTrials.gov — U.S. National Library of Medicine
Europe PMC — European Bioinformatics Institute

Key dependencies: FAISS (Facebook AI Research), sentence-transformers (UKP Lab / Hugging Face), rank-bm25 (Dorian Brown), OpenAI Python SDK, Streamlit, PyTorch.

Note

This tool is for research use only. Clinical decisions must always involve qualified medical professionals. ALS literature may contain preliminary findings that have not been independently replicated.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
assets		assets
data		data
src/als_rag		src/als_rag
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

🧠 ALS-RAG

Table of Contents

Overview

What's New in v2

Summary of Upgrades

.env New Variables

Agents

Agent Quick Reference

Key Features

Architecture

System Architecture

Usage Flow

Query + Verification Sequence

ALS Domain Coverage

Literature Source Distribution (v2)

Technology Stack

Setup & Installation

Prerequisites

Installation

Usage

CLI Reference

Makefile Reference

Web UI

Project Structure

Core Capabilities

📥 Ingestion Pipeline (5 Sources)

🔍 Hybrid Retrieval — Real BM25 (v2 Upgrade)

✅ Citation Verification (New in v2)

🔬 Domain NER

🧬 Clinical Signal Integration

🤖 Evidence-Grounded Generation

Roadmap

Development Status

Contributing

License & Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`.env` New Variables

Packages