Chinese documentation is available in README_zh.md. A longer guide is available in README_detailed.md. Architecture docs are available in English architecture.md and Chinese architecture_zh.md.
This project implements a runnable Python prototype for target-centric biomedical competitive intelligence. It collects data from public sources, normalizes the records, caches query results, and generates structured Markdown/HTML reports.
- Pluggable data source architecture.
- ClinicalTrials.gov v2 API collector.
- PubMed E-utilities collector.
- Offline demo fixtures for deterministic grading and local testing without public API or LLM calls.
- SQLite cache with TTL and report version history.
- Markdown and HTML report output, with Chinese as the default report language.
- HTML reports convert common Markdown formatting from LLM-generated text into real HTML.
- Optional LLM analysis layer for target overview, pipeline summary, research dynamics, and competitive assessment.
- Inline SVG charts for trial phase distribution and publication trend.
- Basic logging and resilient network/API error handling.
- Unit tests for source orchestration, report generation, LLM fallback behavior, and Markdown-to-HTML rendering.
Create a fresh virtual environment from the project root. The commands below install the app and its runtime dependencies, including the OpenAI Python SDK used for optional LLM analysis.
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .If PowerShell blocks activation scripts, use the venv interpreter directly:
.\.venv\Scripts\python.exe -m pip install --upgrade pip
.\.venv\Scripts\python.exe -m pip install -e .Generate a deterministic offline demo report. Offline mode uses the built-in demo records only; it does not call public APIs or the LLM, even when an LLM API key is configured.
python -m research_intel --target HER2 --offline --format bothGenerated files are written to reports/. Reports are generated in Chinese by default. Offline HER2 sample reports are included at reports/sample_HER2_offline.md and reports/sample_HER2_offline.html; live/online HER2 sample reports are included at reports/sample_HER2.md and reports/sample_HER2.html.
To generate the report in English, pass --language english:
python -m research_intel --target HER2 --offline --format both --language englishTo run against live public APIs:
python -m research_intel --target "PD-L1" --format bothThe same language option works with live API runs:
python -m research_intel --target "PD-L1" --format both --language englishOptional PubMed email/tool identification can be configured with environment variables:
$env:PUBMED_EMAIL="your.email@example.com"
$env:PUBMED_TOOL="research-intel-prototype"To use LLM-generated analysis in the four main report sections, configure an OpenAI-compatible chat API key:
$env:OPENAI_API_KEY="your-api-key"DashScope's OpenAI-compatible endpoint is also supported:
$env:DASHSCOPE_API_KEY="your-dashscope-api-key"
$env:RESEARCH_INTEL_LLM_MODEL="qwen-plus"
$env:RESEARCH_INTEL_LLM_ENDPOINT="https://dashscope.aliyuncs.com/compatible-mode/v1"Optional LLM settings:
$env:RESEARCH_INTEL_LLM_MODEL="gpt-4o-mini"
$env:RESEARCH_INTEL_LLM_ENDPOINT="https://api.openai.com/v1"No API keys are hard-coded. If the LLM key is not configured or the LLM call fails, the report falls back to deterministic local summaries.
python -m unittest discover -s testsYou can also use the installed console command:
research-intel --target HER2 --offline --format markdownUseful CLI options:
--target HER2sets the target or biomarker to research.--offlineuses built-in demo records only; it does not call public APIs or the LLM.--format markdown|html|bothcontrols output file type.--language chinese|englishcontrols report language. The default ischinese.--verboseenables debug logging.
New collectors should follow the existing DataSource interface in research_intel/sources/base.py.
- Create a new module under
research_intel/sources/. - Implement a class with a
nameand afetch(target: str) -> SourceResultmethod. - Convert source-specific API responses into the normalized models in
research_intel/models.py, usuallyTrialRecordorPublicationRecord. - Return a
SourceResultwith records, warnings, and cache status. - Add the new source to the live
sourceslist inIntelligencePipeline.from_settings().
The current pipeline automatically deduplicates and renders TrialRecord and PublicationRecord. For new data categories such as approvals, patents, conference abstracts, or company pipeline pages, either map them into the closest existing record type for a quick prototype or add a new model type and update the pipeline and report renderer.
.
.gitignore Git ignore rules for local/runtime artifacts
pyproject.toml Package metadata and console script definition
requirements.txt Runtime dependency list for pip installs
README.md Main English quick-start documentation
README_detailed.md Detailed English documentation
README_zh.md Chinese documentation
written_test.md Original project requirements
research_intel/
__init__.py Package marker
__main__.py CLI entrypoint
app.py Pipeline orchestration
cache.py SQLite cache and report versions
config.py Settings loaded from environment
http.py Resilient stdlib HTTP client
llm.py Optional LLM report analyzer
models.py Normalized dataclasses
report.py Markdown/HTML report renderer
sources/ Pluggable collectors
base.py DataSource interface
clinical_trials.py ClinicalTrials.gov collector
offline.py Built-in demo fixture source
pubmed.py PubMed collector
docs/
architecture.md System design document
architecture_zh.md Chinese system design document
reports/
sample_HER2.md Example Markdown report generated from live/online sources
sample_HER2.html Example HTML report generated from live/online sources
sample_HER2_offline.md / .html Example reports generated from offline fixtures
*.md / *.html Generated report outputs
*.sqlite3* Local cache/report metadata databases
tests/
test_pipeline.py Offline pipeline and report tests