MedLabelIQ is a production-oriented, evidence-grounded medication question-answering system that combines:
- official DailyMed SPL drug-label evidence,
- RxNorm medication identity reasoning,
- source-aware orchestration,
- and mixed-source query decomposition + grounded synthesis.
The system answers medication questions only when its selected knowledge source directly supports the response. Otherwise, it returns a deterministic insufficient_evidence result.
MedLabelIQ was designed as a modern replacement for an earlier web-scraped medical QA prototype, addressing its core limitations:
- weak knowledge-source organization,
- unstructured document handling,
- shallow retrieval,
- unreliable chatbot responses,
- limited grounding,
- and lack of production observability.
For a detailed project walkthrough, UI demonstrations, API examples, evaluation proof, and observability analysis, see:
MedLabelIQ Comprehensive Documentation
MedLabelIQ uses two distinct knowledge channels:
| Source | Purpose |
|---|---|
| DailyMed SPL labels | Clinical label-grounded QA: indications, warnings, interactions, adverse reactions, dosage, etc. |
| RxNorm | Medication identity reasoning: brand/generic equivalence, active ingredients, brand-name lookup, identity definitions. |
Examples:
Is Eliquis the same as apixaban?
→ RxNorm identity route
Can apixaban be taken with aspirin?
→ DailyMed clinical-label route
Is Eliquis the same as apixaban and can it prevent stroke?
→ Mixed-source composed route
Before answering, the system performs:
- drug mention detection,
- optional RxNorm-based drug normalization,
- retrieval-family planning,
- source-route planning,
- execution of one of three answer branches:
RxNorm identity branch
DailyMed label branch
Mixed-source composition branch
The router can select:
rxnorm_identity
dailymed_label
multi_source_composed
For compound questions that combine identity and clinical intent, MedLabelIQ decomposes the original query into branch-specific subqueries.
Example:
Original:
Is Eliquis the same as apixaban and can it prevent stroke?
Identity subquery:
Is Eliquis the same as apixaban?
Clinical subquery:
Can apixaban prevent stroke?
The system then:
- answers the identity branch using RxNorm,
- answers the clinical branch using DailyMed label evidence,
- synthesizes one grounded final answer,
- preserves both:
R*citations for RxNorm identity support,E*citations for DailyMed label evidence.
Example composed answer:
Yes. RxNorm maps Eliquis and apixaban to the same ingredient concept: apixaban.
Yes. Apixaban is indicated to reduce the risk of stroke and systemic embolism
in patients with nonvalvular atrial fibrillation.
The DailyMed pipeline ingests SPL XML labels rather than loosely scraped web pages.
It preserves:
- label metadata,
- SET IDs,
- label versions,
- product and ingredient records,
- section hierarchy,
- section codes,
- retrieval-family mappings,
- and evidence provenance.
MedLabelIQ parses nested label sections and maps them to canonical retrieval families such as:
warnings_and_precautionsboxed_warningindications_and_usageadverse_reactionsdrug_interactionsdosage_and_administrationcontraindicationsclinical_studiesmedication_guide
This allows retrieval to respect the clinical structure of the label rather than treating labels as flat documents.
The DailyMed QA branch uses:
- PostgreSQL lexical retrieval,
- Qdrant dense vector retrieval,
- hybrid Reciprocal Rank Fusion,
- drug-concept filtering,
- retrieval-family filtering,
- compact evidence-pack selection.
This improves relevance while reducing redundant prompt context.
The DailyMed answer branch uses a Groq-hosted LLM with:
- strict grounded answer schema,
- explicit evidence citations,
- evidence summaries,
- verifier integration,
- deterministic safety-note insertion.
Clinical answers cite evidence IDs such as:
E1, E2, E3
RxNorm identity answers cite:
R1, R2
Mixed-source composed answers may cite both:
R1, R2, E1
MedLabelIQ is intentionally conservative.
When evidence is not sufficient, it returns:
{
"status": "insufficient_evidence",
"answer": "The retrieved drug-label evidence is not sufficient to answer this question reliably.",
"citations": [],
"evidence_summary": "No retrieved evidence directly established the requested claim."
}The system includes:
- deterministic insufficient-evidence fallbacks,
- post-generation verifier support,
- guardrails for unsupported high-certainty claims,
- guardrails for unsupported negative treatment-use claims,
- conservative behavior when branch-specific support is incomplete.
MedLabelIQ includes:
- FastAPI backend,
- Streamlit front end,
- PostgreSQL persistence,
- Qdrant vector store,
- Dockerized local stack,
- observability request logs,
- source-aware analytics exports,
- evaluation harnesses,
- pytest suite,
- GitHub Actions CI.
Medication QA is high-stakes. A system that retrieves topically related text but fabricates unsupported conclusions is not trustworthy.
MedLabelIQ is built around a stricter principle:
Only answer when the selected knowledge source directly supports the response. Otherwise, abstain.
This project demonstrates practical engineering across:
- domain-grounded RAG,
- structured medical data ingestion,
- biomedical entity normalization,
- multi-source orchestration,
- query decomposition,
- hybrid search,
- LLM grounding,
- deterministic safety controls,
- API design,
- observability,
- evaluation,
- and containerized deployment.
flowchart TD
A[User Question] --> B[FastAPI /qa/answer]
B --> C[Drug Mention Detection]
C --> D[RxNorm Drug Normalization]
D --> E[Retrieval-Family Planner]
E --> F[Source Router]
F -->|Identity query| G[RxNorm Identity Branch]
F -->|Clinical label query| H[DailyMed Label QA Branch]
F -->|Mixed identity + clinical query| I[Mixed-Source Composition Branch]
subgraph DailyMed Knowledge Pipeline
J[DailyMed SPL APIs] --> K[Label Discovery and History Fetch]
K --> L[SPL XML Download]
L --> M[Structured XML Parser]
M --> N[Canonical Section Mapping]
N --> O[PostgreSQL Metadata Store]
N --> P[Section-Aware Chunk Builder]
P --> Q[PostgreSQL Lexical Index]
P --> R[Qdrant Dense Vector Index]
end
H --> S[Hybrid Retrieval]
Q --> S
R --> S
S --> T[Compact Evidence Pack]
T --> U[Groq Grounded Answer Generator]
U --> V[Verifier and Deterministic Guardrails]
G --> W[Structured RxNorm Identity Answer]
I --> X[Identity Subquery]
I --> Y[Clinical Subquery]
X --> G
Y --> H
W --> Z[Mixed / Final Answer Synthesis]
V --> Z
Z --> AA[Final Answer or Abstention]
AA --> AB[Streamlit UI]
AA --> AC[QA Request Logs]
T --> AD[DailyMed Evidence Logs]
AC --> AE[Source-Aware Analytics]
AD --> AE
Used for identity-style questions such as:
Is Eliquis the same as apixaban?
What is the generic name of Glucophage?
What is the active ingredient in Eliquis?
Is Glucophage a brand name?
Flow:
Query
→ Identity intent detection
→ RxNorm term resolution
→ Ingredient / brand concept traversal
→ Deterministic structured answer
→ R-citations
Used for clinical label questions such as:
What is omeprazole used for?
Can apixaban be taken with aspirin?
Can metformin cause lactic acidosis?
Does apixaban treat bacterial infections?
Flow:
Query
→ Drug detection / normalization
→ Retrieval-family planning
→ Hybrid label retrieval
→ Compact evidence pack
→ Grounded answer generation
→ Verification + guardrails
→ E-citations or abstention
Used for compound questions such as:
Is Glucophage the same as metformin and what is it used for?
Is Eliquis the same as apixaban and can it prevent stroke?
Is Glucophage a brand name and what is it used for?
Flow:
Original query
→ Mixed-source route detection
→ Identity subquery decomposition
→ Clinical subquery decomposition
→ RxNorm identity execution
→ DailyMed clinical execution
→ Evidence-aware synthesis
→ R-citations + E-citations
The ingestion pipeline builds a reproducible smoke corpus of 12 representative medication concepts:
- acetaminophen
- ibuprofen
- metformin
- lisinopril
- atorvastatin
- amoxicillin
- sertraline
- albuterol
- omeprazole
- apixaban
- isotretinoin
- methotrexate
The pipeline:
- discovers label metadata,
- retrieves label version history,
- downloads SPL XML packages,
- stores manifests and checksums,
- validates artifact consistency,
- parses hierarchical SPL sections,
- chunks retrievable clinical text,
- indexes chunks for lexical and dense retrieval.
| Metric | Value |
|---|---|
| Drugs in smoke corpus | 12 |
| Retrievable sections processed | 520 |
| Chunks created | 867 |
| Maximum words per chunk | 220 |
| Chunk overlap | 40 words |
uv run python -m medlabeliq.retrieval.search_cli `
--query "acid-mediated GERD" `
--drug omeprazole `
--family indications_and_usage `
--limit 5| Metric | Score |
|---|---|
| Cases | 12 |
| Hit@1 | 1.000 |
| Hit@5 | 1.000 |
| MRR | 1.000 |
| Metric | Score |
|---|---|
| Cases | 12 |
| Hit@1 | 0.333 |
| Hit@5 | 0.333 |
| MRR | 0.333 |
This gap motivated the addition of dense retrieval and hybrid Reciprocal Rank Fusion.
| Metric | Score |
|---|---|
| Overall pass | 12/12 |
| Status accuracy | 12/12 |
| Answered-case pass | 8/8 |
| Abstention-case pass | 4/4 |
| Citation-policy pass | 12/12 |
| Cited-heading pass | 12/12 |
| Safety-note pass | 12/12 |
| Metric | Score |
|---|---|
| Overall pass | 16/16 |
| Status accuracy | 16/16 |
| Answered-case pass | 10/10 |
| Abstention-case pass | 6/6 |
| Citation-policy pass | 16/16 |
| Cited-heading pass | 16/16 |
| Safety-note pass | 16/16 |
The challenge set includes:
- paraphrased answerable questions,
- negative unsupported treatment claims,
- unsupported claims requiring abstention,
- guarantee-style overgeneralization traps,
- medically sensitive warning and contraindication questions.
| Metric | Score |
|---|---|
| Cases | 11 |
| Overall pass | 11/11 |
| Status accuracy | 11/11 |
| Source-route accuracy | 11/11 |
| Source-route-status accuracy | 11/11 |
| Family-plan-status accuracy | 11/11 |
| Retrieval-family accuracy | 3/3 |
| Citation-policy pass | 11/11 |
| Citation-reference pass | 11/11 |
| Safety-note pass | 11/11 |
| Metric | Score |
|---|---|
| Cases | 19 |
| Overall pass | 19/19 |
| Status accuracy | 19/19 |
| Source-route accuracy | 19/19 |
| Source-route-status accuracy | 19/19 |
| Family-plan-status accuracy | 19/19 |
| Retrieval-family accuracy | 9/9 |
| Citation-policy pass | 19/19 |
| Citation-reference pass | 19/19 |
| Safety-note pass | 19/19 |
The challenge benchmark covers:
- supported RxNorm identity queries,
- unsupported identity queries requiring abstention,
- brand-name clinical questions,
- interaction and indication routing,
- ambiguous clinical queries,
- mixed-source identity + clinical questions,
- composed answers requiring both
R*andE*citations.
When support is insufficient, MedLabelIQ returns:
{
"status": "insufficient_evidence",
"answer": "The retrieved drug-label evidence is not sufficient to answer this question reliably.",
"citations": [],
"evidence_summary": "No retrieved evidence directly established the requested claim."
}Example:
Does metformin guarantee weight loss?
The system abstains unless retrieved label evidence explicitly supports guarantee-level certainty.
Example:
Does apixaban treat bacterial infections?
The system does not infer a negative claim merely because the retrieved label lists other uses. If the target claim is not explicitly established, it abstains.
For a mixed query, both branches must produce sufficient support:
Identity branch must be supported
+
Clinical label branch must be supported
Otherwise, the system returns insufficient_evidence rather than composing a partial answer.
The FastAPI backend exposes:
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/ |
Service overview |
GET |
/health |
PostgreSQL, Qdrant, and LLM health |
GET |
/drugs |
Indexed drug concept summaries |
GET |
/families |
Retrieval-family summaries |
GET |
/corpus/stats |
Corpus build and indexing statistics |
GET |
/rxnorm/version |
RxNorm API version metadata |
POST |
/normalize/drug |
Normalize a brand, generic, or noisy medication mention |
POST |
/qa/answer |
Grounded medication QA |
POST |
/retrieval/debug |
Retrieval-only evidence inspection |
$body = @{
query = "Can metformin cause dangerous acid buildup in the blood?"
drug = "metformin"
family = "warnings_and_precautions"
include_evidence = $true
include_diagnostics = $true
} | ConvertTo-Json
Invoke-RestMethod `
-Method Post `
-Uri "http://127.0.0.1:8011/qa/answer" `
-ContentType "application/json" `
-Body $body |
ConvertTo-Json -Depth 40$body = @{
query = "Is Eliquis the same as apixaban?"
include_evidence = $true
include_diagnostics = $true
} | ConvertTo-Json
Invoke-RestMethod `
-Method Post `
-Uri "http://127.0.0.1:8011/qa/answer" `
-ContentType "application/json" `
-Body $body |
ConvertTo-Json -Depth 80Expected high-level behavior:
planned_source = rxnorm_identity
result.status = answered
citations = R1, R2
$body = @{
query = "Is Eliquis the same as apixaban and can it prevent stroke?"
include_evidence = $true
include_diagnostics = $true
} | ConvertTo-Json
Invoke-RestMethod `
-Method Post `
-Uri "http://127.0.0.1:8011/qa/answer" `
-ContentType "application/json" `
-Body $body |
ConvertTo-Json -Depth 100Expected high-level behavior:
planned_source = multi_source_composed
result.status = answered
citations = R1, R2, E1
identity_evidence = present
evidence = present
mixed_source_composition.status = composed_answered
The Streamlit front end includes:
- backend health panel,
- corpus snapshot,
- drug and retrieval-family filters,
- six example prompts,
- grounded answer display,
- status pills,
- source-route badges,
- citation chips,
- citation legend:
E*= DailyMed label evidence,R*= RxNorm identity evidence,
- DailyMed evidence expanders,
- RxNorm identity evidence expanders,
- routing and source-plan expander,
- mixed-source decomposition panel,
- verifier and guardrail diagnostics,
- raw diagnostics JSON,
- retrieval-debug tab,
- recent query history.
Local UI:
http://127.0.0.1:8501
Every QA request can be logged to PostgreSQL using:
qa_request_logqa_evidence_log
- query text,
- requested and resolved drug,
- drug-resolution status,
- detected drug mention,
- drug-mention detection status,
- requested and planned retrieval family,
- family-plan status and intent,
- planned source,
- source-plan status and intent,
- mixed-source composition status,
- final answer status,
- citations,
- evidence summary,
- safety note,
- proposed answer status,
- verifier verdict and rationale,
- guardrail state,
- DailyMed evidence count,
- RxNorm identity evidence count,
- API latency,
- timestamp.
uv run python -m medlabeliq.observability.generate_qa_analyticsOutputs:
data/interim/qa_analytics/
outputs/qa_analytics/
- final answer status counts,
- latency summary statistics,
- intervention counts,
- verifier verdict distribution,
- requests by planned source,
- source-plan status distribution,
- family-plan status distribution,
- mixed-source composition status distribution,
- final answer status by source type,
- latency by planned source,
- identity-evidence count distribution,
- total support-evidence count distribution,
- evidence-family usage,
- cited evidence-family usage,
- daily request volume,
- CSV exports,
- PNG plots.
Launch the full stack:
docker compose up --build -dServices:
| Service | Port |
|---|---|
| PostgreSQL | 55432 |
| Qdrant | 6333 |
| FastAPI backend | 8011 |
| Streamlit UI | 8501 |
After startup:
API: http://127.0.0.1:8011
Docs: http://127.0.0.1:8011/docs
UI: http://127.0.0.1:8501
Check health:
Invoke-RestMethod `
-Method Get `
-Uri "http://127.0.0.1:8011/health" |
ConvertTo-Json -Depth 10git clone <YOUR_REPOSITORY_URL>
cd MedLabelIQCopy-Item .env.example .envFill in:
LLM_API_KEY=<your-groq-api-key>
uv syncdocker compose up -d postgres qdrantuv run python -m medlabeliq.db.create_observability_schemauv run uvicorn medlabeliq.api.app:app --host 127.0.0.1 --port 8011 --reloaduv run streamlit run src\medlabeliq\ui\streamlit_app.py --server.port 8501uv run python -m medlabeliq.validation.validate_step3_artifactsuv run python -m medlabeliq.parsing.parse_smoke_setuv run python -m medlabeliq.validation.validate_section_hierarchyuv run python -m medlabeliq.chunking.build_section_chunksuv run python -m medlabeliq.chunking.validate_section_chunksuv run python -m medlabeliq.evaluation.evaluate_lexical_retrievaluv run python -m medlabeliq.evaluation.evaluate_grounded_qauv run python -m medlabeliq.evaluation.evaluate_grounded_qa `
--eval-set data\evaluation\qa_generation_eval_challenge.yaml `
--output data\interim\grounded_qa_eval_challenge_results.csvuv run python -m medlabeliq.evaluation.evaluate_multisource_orchestrationuv run python -m medlabeliq.evaluation.evaluate_multisource_orchestration `
--eval-set data\evaluation\multisource_orchestration_eval_challenge.yaml `
--output data\interim\multisource_orchestration_eval_challenge_results.csvuv run python -m medlabeliq.observability.generate_qa_analyticsRun tests locally:
uv run pytestCurrent local test result:
64 passed
The repository includes GitHub Actions CI to automatically run tests on pushes and pull requests.
MedLabelIQ/
├── .github/
│ └── workflows/
│ └── ci.yml
├── data/
│ ├── evaluation/
│ ├── interim/
│ └── raw/
├── outputs/
├── src/
│ └── medlabeliq/
│ ├── api/
│ ├── chunking/
│ ├── config/
│ ├── db/
│ ├── evaluation/
│ ├── generation/
│ ├── ingestion/
│ ├── observability/
│ ├── orchestration/
│ ├── parsing/
│ ├── qdrant_store/
│ ├── retrieval/
│ ├── rxnorm/
│ ├── ui/
│ └── validation/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
├── uv.lock
└── README.md
- Language: Python 3.12
- Dependency management: uv
- API: FastAPI
- UI: Streamlit
- Structured database: PostgreSQL
- Vector database: Qdrant
- Medication identity source: RxNorm
- Medication label source: DailyMed SPL
- LLM provider: Groq
- Embeddings: sentence-transformer-based dense retrieval
- Containerization: Docker, Docker Compose
- Testing: pytest
- CI: GitHub Actions
- The current DailyMed corpus is a curated 12-drug smoke set, not the full DailyMed universe.
- RxNorm identity routing is deterministic and scoped to identity-style questions currently supported by the orchestration logic.
- Mixed-source composition supports intentionally structured identity + clinical conjunction patterns rather than arbitrary multi-hop natural language decomposition.
- The system summarizes official label evidence; it is not a diagnosis, prescribing, or clinical decision tool.
- Evaluation sets are project benchmarks rather than large-scale clinician-authored gold standards.
- Guardrails target observed failure modes and can be expanded further.
- Scale the DailyMed corpus beyond the 12-drug smoke set.
- Add larger clinician-reviewed benchmark suites.
- Broaden mixed-source decomposition patterns.
- Add support for more complex multi-branch query plans.
- Extend operational dashboards beyond CSV/PNG analytics outputs.
- Introduce authentication, rate limiting, and deployment hardening.
- Add continuous ingestion for updated DailyMed label versions.
- Explore retrieval reranking and evidence sufficiency scoring improvements.
MedLabelIQ is an educational and research-oriented medication question-answering system.
It summarizes retrieved medication identity relationships and official drug-label evidence and is not a substitute for medical advice from a qualified clinician or pharmacist.