Skip to content

pouriamrt/LiveDB

Repository files navigation

LiveDB — Continuous Literature Ingestion, Gap Analysis, and Multi-Agent Research Assistant

A production-ready async pipeline that discovers the latest biomedical papers, classifies abstracts for PICOS-style eligibility, acquires legal full-text, ingests into a pgvector-backed knowledge base, identifies research gaps via LLM-powered analysis, and exposes a multi-agent research assistant (Knowledge / SQL / Reasoning / General / Gap Analysis) coordinated via Agno Team/AgentOS.


Table of Contents


Overview

LiveDB continuously surfaces recent literature for a given query, performs eligibility triage using a fine-tuned multi-head classifier, downloads full-text, indexes into a pgvector-backed knowledge base, and exposes that corpus through a multi-agent research assistant layer coordinated via Agno Team + AgentOS.

It also includes a standalone Gap Analysis pipeline that takes a natural language research question, live-fetches papers from OpenAlex and PubMed, clusters them by theme, and uses LLM reasoning to identify research gaps — producing interactive HTML dashboards and PDF reports.

It is designed to be:

  • Asynchronous & resilient (httpx, asyncio, tenacity retries; bounded concurrency for downloads)
  • Legally compliant (Open Access first; BioC fallback; PMC OA FTP for licensed content)
  • RAG-ready (semantic chunking; reference removal; pgvector hybrid search)
  • Operable (Prefect orchestration, structured logging, configurable concurrency)
  • Agentic (specialist agents + coordinator for interactive post-indexing querying)
  • Gap-aware (LLM-powered research gap identification with interactive reports)

Key Features

  • Multi‑source discovery
    • OpenAlex: newest articles by publication_date (filters for OA & language).
    • PubMed + PMC: PMID discovery + legal full‑text via PMC OA utilities.
  • Abstract triage
    • Multi‑task classifier (P_AB, I_AB, C_AB, O_AB, S_AB) → yes/maybe/no.
    • Simple rule for final_pred: if S_AB_pred == "no" then final_pred = "no" else "yes" (customizable).
  • Full‑text acquisition
    • Direct OA PDF via oa_pdf (OpenAlex) or PMC OA FTP.
    • BioC fallback → reconstructs text and renders to PDF to preserve the ingestion contract.
  • Chunking
    • Semantic chunking with CustomChunking that stops at “References” to avoid noisy embeddings.
  • Indexing
    • Agno Knowledge layer → Postgres/pgvector for hybrid retrieval + separate contents store.
  • Orchestration & Observability
    • Prefect flow with caching, retries, bounded concurrency, and rotating log files.
  • Agentic post-indexing interface
    • 5 specialist agents (Knowledge / SQL / Reasoning / General / Gap Analysis) coordinated via Agno Team + AgentOS.
    • Makes the ingested corpus queryable immediately after ingestion (RAG-style, but without re-fetching PDFs).
  • Research gap analysis
    • Natural language query → live paper fetch → structured extraction → thematic clustering → LLM gap identification.
    • Outputs interactive HTML dashboard (Chart.js), PDF report, and JSON for programmatic access.

Architecture (ETL)

                ┌──────────────────────────────────────────────────────────┐
                │                           User                           │
                └───────────────┬──────────────────────────────────────────┘
                                │ query="dementia", window, max_records
                                ▼
                     ┌──────────────────────┐
                     │  Prefect Flow        │
                     │  (main.livedb_flow)  │
                     └─────────┬────────────┘
          ┌────────────────────┼────────────────────┐
          ▼                    ▼                    ▼
┌─────────────────┐   ┌──────────────────┐   ┌───────────────────────┐
│ OpenAlex Fetch  │   │ PubMed ESearch   │   │ Classifier (multihead)│
│ (OA results)    │   │ + EFetch (meta)  │   │  P/I/C/O/S -> yes/no  │
└───────┬─────────┘   └─────────┬────────┘   └───────────┬───────────┘
        │                       │                        │ filter(final_pred=="yes")
        │                       │                        ▼
        │                       │             ┌────────────────────────┐
        │                       │             │ Full-text Acquisition  │
        │                       │             │ OA PDF | PMC OA | BioC │
        │                       │             └───────────┬────────────┘
        │                       │                         ▼
        │                       │              ┌─────────────────────┐
        │                       │              │ Chunk & Embed       │
        │                       │              │ (CustomChunking)    │
        │                       │              └──────────┬──────────┘
        │                       │                         ▼
        │                       │           ┌────────────────────────────┐
        │                       └──────────▶  Agno Knowledge (pgvector)  │
        │                                   │  + Contents Postgres       │
        │                                   └────────────────────────────┘
        ▼
   Logs/Prefect UI

Data Flow

  1. Discovery

    • OpenAlex search + filter window: (start_day, days_back) defines a moving window (e.g., “from 31 to 30 days ago”).
    • PubMed ESearch → PMIDs for the same window; EFetch returns structured metadata.
  2. Triage

    • livedb/CheckAbsModel.py loads a Saved Multi‑Task Model (5 heads) from Config.MODEL_DIR and predicts task‑wise labels + confidences.
    • final_pred policy is currently simple and can be replaced with your inclusion logic.
  3. Acquisition

    • OpenAlex: try oa_pdf with headers, fall back to Playwright with stealth when needed (handles CF/JS‑gated flows).
    • PMC: OA FTP packages (.tar.gz) are downloaded and extracted to PDFs; if not available, fetch BioC XML and render to PDF text.
  4. Ingestion

    • CustomChunking removes everything after a “References” sentinel.
    • Metadata (year, authors, journal, P/I/C/O/S flags, etc.) is stored alongside embedded chunks in pgvector and contents tables.
  5. QA

    • Agno Knowledge layer provides hybrid search + contents Postgres for retrieval.
    • ResearchAssistantTeam coordinates specialist agents (Knowledge, SQL, Reasoning, General) for complex queries.
    • Shared memory/state between coordinator and agents; conversational history enabled.

Gap Analysis

The gap analysis pipeline takes a natural language research question and produces a structured report identifying contradictions, under-explored areas, methodological limitations, population gaps, missing comparisons, and future research directions.

Pipeline Phases

User Question ──► Query Translation ──► Fetch Papers ──► Extract Findings ──► Cluster Themes ──► Analyze Gaps ──► Reports
  (natural          (LLM → 2-4           (OpenAlex +      (Batched LLM        (UMAP + HDBSCAN    (Two-pass LLM     (HTML +
   language)         keywords)            PubMed)          JSON extraction)    + LLM labeling)     gap analysis)      PDF + JSON)
Phase Module Description
0 fetch.py:translate_query LLM converts natural language into 2-4 optimized keyword queries
1 fetch.py:fetch_papers Parallel fetch from OpenAlex + PubMed, deduplicate, optional PICOS filter
2 extract.py:extract_papers Batched LLM extraction of claims, methodology, PICO elements, limitations
3 cluster.py:cluster_papers OpenAI embeddings → UMAP reduction → HDBSCAN clustering → LLM theme labels
4 analyze.py:analyze_gaps Within-cluster gap analysis (parallel) + cross-cluster synthesis (single call)
5 report.py Interactive HTML dashboard (Chart.js), styled PDF, machine-readable JSON

Gap Types Identified

  • Contradiction — findings that disagree across papers
  • Under-explored — subtopics with insufficient investigation
  • Methodological — limitations in study designs used
  • Population — demographics or patient groups not covered
  • Missing comparison — interventions not compared head-to-head
  • Future direction — what authors explicitly say needs more research

Usage

CLI (Option 3):

python main.py
# Select [3] Gap Analysis → enter research question → configure scope

API:

# Trigger analysis (returns immediately, runs in background)
curl -X POST http://localhost:7777/gap-analysis \
  -H "Content-Type: application/json" \
  -d '{"query": "mRNA vaccines for cancer immunotherapy", "max_records": 100, "days_back": 180}'

# List reports
curl http://localhost:7777/gap-analysis

# View interactive dashboard
curl http://localhost:7777/gap-analysis/{report_id}

Configuration

Setting Default Description
GAP_LLM_BATCH_SIZE 5 Papers per LLM extraction call
GAP_LLM_CONCURRENCY 5 Max concurrent LLM calls
GAP_DEFAULT_SCOPE 100 Default number of papers to analyze
GAP_DEFAULT_DAYS 180 Default lookback window in days
GAP_REPORTS_DB_URL same as PGVECTOR_CONTENTS_URL Postgres URL for gap reports storage

Repository Layout

agents/
  Agents.py              # single-agent definitions & entrypoints
  RunTeam.py             # orchestrates multi-agent execution + gap analysis routes
  Teams.py               # team definitions, roles, routing rules / topology
dbs/
  IngestToDB.py          # Agno Knowledge setup + async ingestion + gap report storage
  utils.py               # CustomChunking (SemanticChunking subclass)
gap_analysis/
  __init__.py            # shared AsyncOpenAI client
  models.py              # Pydantic models (PaperMetadata, PaperExtraction, ThemeCluster, ResearchGap, GapReport)
  prompts.py             # LLM prompt templates for all phases
  fetch.py               # Phase 0-1: query translation + multi-source paper fetching
  extract.py             # Phase 2: batched LLM structured extraction
  cluster.py             # Phase 3: UMAP + HDBSCAN clustering + LLM theme labeling
  analyze.py             # Phase 4: within-cluster + cross-cluster gap analysis
  report.py              # Phase 5: PDF + HTML report generation
  pipeline.py            # Prefect flow orchestrating all phases
  templates/
    dashboard.html       # Jinja2 interactive dashboard template (Chart.js)
livedb/
  CheckAbsModel.py       # Multi-head classifier load + async inference
  GetLatestPapers.py     # PubMed/PMC helpers, FTP, BioC -> PDF, utilities
  OpenAlexDownload.py    # OpenAlex client + robust PDF downloader
  utils.py               # save_text_as_pdf_async (ReportLab)
.gitignore
.python-version
Config.py                # Pydantic config model + env var binding
main.py                  # Prefect flow: ETL [1], Agent [2], Gap Analysis [3]

Quick Start

Prerequisites

  • Python: 3.13 (as per .python-version)
  • Postgres with pgvector extension enabled
  • Node‑less Playwright Python runtime requirements (Chromium is auto‑managed by Playwright)
  • System packages commonly needed for playwright & reportlab (varies by OS)
# Ubuntu/Debian (example)
sudo apt-get update
sudo apt-get install -y libglib2.0-0 libnss3 libatk1.0-0 libatk-bridge2.0-0 \
    libx11-xcb1 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 \
    libasound2 fonts-liberation libxshmfence1

Environment Variables

Create a .env in repo root:

# NCBI
NCBI_EMAIL=you@example.com
NCBI_API_KEY=

# OpenAlex
OPENALEX_MAILTO=you@example.com

# PMC FTP (optional if anonymous)
FTP_USER=anonymous
FTP_PASSWORD=anonymous@

# OpenAI / Embeddings
OPENAI_API_KEY=sk-...
# Choose models in Config.py
# MODEL_NAME=gpt-4.1-mini
# EMBEDDING_MODEL=text-embedding-3-small

# Postgres / pgvector
PGVECTOR_URL=postgresql+psycopg://user:pass@host:5432/dbname
PGVECTOR_TABLE=knowledge_vectors
PGVECTOR_CONTENTS_URL=postgresql+psycopg://user:pass@host:5432/dbname
PGVECTOR_CONTENTS_TABLE=knowledge_contents

The Config.py sets sensible defaults (e.g., API endpoints, headers, paths). Adjust the table names to match your schema.

Install

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
python -m pip install --upgrade pip

# Core deps (pin versions as needed)
pip install httpx[http2] tenacity arrow lxml aiofiles aioftp pandas loguru pydantic python-dotenv \
            reportlab tqdm prefect playwright playwright-stealth torch transformers \
            agno pgvector psycopg[binary] sqlalchemy \
            openai hdbscan umap-learn scikit-learn jinja2

# Install browser binaries for Playwright
python -m playwright install chromium

Run

python main.py
# [1] ETL Pipeline — fetch, triage, download, and ingest papers
# [2] AI Agent    — start FastAPI + AgentOS on http://localhost:7777
# [3] Gap Analysis — identify research gaps from a natural language question

Optionally run prefect server start in a separate terminal for flow visualization.


Configuration

Config.py (Pydantic) reads environment variables and exposes:

  • External APIs: E‑Utils, BioC, OpenAlex
  • Storage: PDF_DIR, MODEL_DIR
  • Models: MODEL_NAME, EMBEDDING_MODEL
  • DB: PGVECTOR_URL, PGVECTOR_TABLE, PGVECTOR_CONTENTS_URL, PGVECTOR_CONTENTS_TABLE
  • HTTP Headers: COMMON_HEADERS for PDF downloads
  • HEADLESS mode for Playwright

You can override via .env or environment variables.


ETL Details

OpenAlex Fetch

  • livedb/OpenAlexDownload.py::fetch_openalex_latest(...)
  • Filters on publication_date window; only_articles=True; only_oa=True; language=en
  • Returns a de‑duplicated DataFrame with fields:
    • id, pmid, title, publication_date, pub_year, journal, doi, is_oa, oa_pdf, authors, concepts, url, abstract

PubMed/PMC Fetch

  • pubmed_esearch → PMIDs for the window (edat filter); retmax configurable
  • pubmed_efetch → Medline XML parsed into: pmid, pmcid, doi, title, journal, pub_year, authors, url, abstract
  • try_fetch_pmc_fulltext_pdf
    • Calls PMC OA service for licensed links (FTP: oa_package/oa_pdf)
    • Extracts PDFs from *.tar.gz or downloads direct .pdf
    • If no PDF, tries BioC → reconstructs text and persists as PDF

Abstract Classification

  • livedb/CheckAbsModel.py
    • Loads tokenizer + encoder from MODEL_DIR
    • Multi‑head linear classifiers (one per task) → logits → softmax
    • Returns task predictions & confidences
  • Current decision rule:
    • final_pred = "no" iff S_AB_pred == "no", else "yes"

Full‑text Acquisition

  • download_pdf_async first tries HTTP (HEAD/GET) with realistic headers & referer
  • On 401/403 or non‑PDF responses: Playwright (Chromium) + Stealth fallback to handle bot mitigations
  • PMC path tries legal OA routes and licenses; BioC produces a text PDF via ReportLab

Chunking & Ingestion

  • dbs/utils.py::CustomChunking derives from SemanticChunking and truncates at “References”
  • dbs/IngestToDB.py wires:
    • Agno OpenAIEmbedder
    • PDFReader (split pages, read images, chunking strategy)
    • PgVector (hybrid search) + PostgresDb for contents
  • Per‑record metadata stored with chunks:
    • publication_year, date_added, author, title, journal, abstract,
    • population_flag, intervention_flag, comparator_flag, outcome_flag, study_design_flag, qualification_flag

Database Expectations

You should provision:

  • pgvector extension: CREATE EXTENSION IF NOT EXISTS vector;
  • Tables (example; Agno can manage its own schema—adjust if you manage DDL yourself):
    • knowledge_vectors (id, doc_id, chunk, embedding, metadata, ...)
    • knowledge_contents (doc_id, path, metadata, ...)

Ensure the PGVECTOR_* URLs and table names in .env match your deployment.


Operational Guidance

Logging

  • Rotating file: logs/livedb.log (10 MB rotation, 10-day retention)
  • Prefect‑compatible sink forwards Loguru records to the task/flow run logger

Performance Tuning

  • Concurrency:
    • Downloads: DOWNLOAD_SEM = 8 (increase cautiously—remote servers may throttle)
    • Playwright: BROWSER_SEM = 3 (each Chromium is heavy)
  • Retries & timeouts are defined via tenacity decorators and task‑level Prefect options

Troubleshooting

Symptom Likely Cause Fix
403/401 on PDF Bot/CF protection Ensure Playwright installed; keep HEADLESS=True; fallback path triggers
Non‑PDF bytes saved Misleading content-type or redirect to HTML _looks_like_pdf guard logs errors; fallback to Playwright
Empty OA results Filter window too narrow Increase max_records or widen start_day/days_back
CUDA OOM / CPU slow Model too large or env mismatch Set DEVICE=cpu in classifier call or reduce batch size
No pgvector table Missing DDL Create tables or let Agno manage; verify PGVECTOR_* configs
BioC fallback garbled Unicode font missing Pass ttf_font_path to save_text_as_pdf_async

Security & Compliance Notes

  • Only ingest legally accessible content (Open Access or licensed through PMC OA services).
  • .gitignore avoids committing .env, pdfs/, models/, and logs.
  • The packed repository (“Repomix” output) may include sensitive references. Treat it as read‑only and avoid re‑distributing raw dumps.

Agents

This build includes a coordinator team with five specialist agents and shared state/backends:

  • KnowledgeAgent

    • Uses a Knowledge base backed by PgVector (hybrid search) and a contents Postgres store.
    • Embeddings via OpenAI (config.EMBEDDING_MODEL).
    • Produces sectioned answers with inline citations; says “I don’t know” when context is missing.
  • SQLAgent

    • Uses PostgresTools (parsed from config.SQL_DATABASE_URL) to inspect schema, preview rows, and run queries.
    • Guidance enforces safe exploration (schema-first, small LIMITs; avoid destructive statements).
    • Explains which tables/columns were used.
  • ReasoningAgent

    • Structured, stepwise analysis with ReasoningTools.
    • Decomposes problems and returns concise rationales.
  • GeneralAgent

    • Handles broad queries or synthesizes outputs from other agents into a single response.
  • GapAnalysisAgent

    • Queries past gap analysis reports stored in Postgres.
    • Answers questions about previously identified research gaps, themes, and trends.

Coordination

  • ResearchAssistantTeam (coordinator)
    • Routes by intent: Knowledge → KnowledgeAgent; SQL → SQLAgent; Reasoning → ReasoningAgent; Gap Analysis → GapAnalysisAgent; otherwise → GeneralAgent.
    • Aggregates members’ findings into one coherent answer with clickable citations.
    • Shares interactions among members for context; leverages conversation memory.
    • Agent runtime exposed via AgentOS (run_team(session_state) returns (AgentOS, FastAPI app)).

Shared Runtime

  • Model: OpenAIChat(id=config.MODEL_NAME)
  • Memory: Postgres (config.PGVECTOR_MEMORY_URL, table config.PGVECTOR_MEMORY_TABLE)
  • Vector DB: PgVector (config.PGVECTOR_TABLE, config.PGVECTOR_URL, hybrid search)
  • Contents DB: Postgres (config.PGVECTOR_CONTENTS_URL, table config.PGVECTOR_CONTENTS_TABLE)
  • Common Settings: Markdown outputs, chat history enabled, shared context (num_history_runs≈4), exponential backoff.

Roadmap

  • Configurable inclusion logic (thresholds using confidences, composite rules with P/I/C/O/S)
  • Batch scheduling & backfill by month/quarter
  • Multi‑tenant schema (namespaced knowledge bases)
  • Inline OCR for scanned PDFs (e.g., Tesseract, PaddleOCR)
  • Evaluation harness (precision/recall of inclusion vs human labels)
  • Metrics → Prometheus/Grafana

FAQ

Q: Can I run without a GPU?
Yes. The classifier selects cuda only if available; otherwise runs on CPU.

Q: How do I widen the freshness window?
Adjust start_day and days_back in livedb_flow(...). Example: start_day=7, days_back=7 (from 14 to 7 days ago).

Q: How do I change the embedding model?
Edit Config.py (EMBEDDING_MODEL) and ensure your embedder supports it.

Q: Where do PDFs go?
Config.PDF_DIR (defaults to ./pdfs).

Q: Can I skip Playwright?
Set HEADLESS=True (default) and rely on httpx first; if you remove Playwright you may lose some gated PDFs.


License

This project is licensed under the MIT License.


Citations & Attribution

  • OpenAlex — community‑maintained index of scholarly works
  • NCBI E‑utils / PubMed / PMC OA — programmatic biomedical literature access
  • Agno — abstraction layer for: Postgres-backed knowledge, pgvector hybrid search, agent memory, and team orchestration
  • pgvector — high‑dimensional vector similarity for Postgres
  • Playwright and playwright‑stealth — browser automation for bot‑gated flows
  • ReportLab — PDF generation for BioC text fallback and gap analysis reports
  • HDBSCAN — density-based clustering for thematic grouping
  • UMAP — dimensionality reduction for embedding-based clustering
  • Chart.js — interactive visualizations in gap analysis dashboards

About

Continuous ETL that discovers, triages, and ingests Open Access biomedical papers into a pgvector-backed knowledge base, with AI agents for retrieval, automation and Gap Analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors