A comprehensive bioinformatics pipeline for downloading and analyzing genes associated with specific phenotypes from multiple biological databases. This package provides both R scripts for gene retrieval and Python analysis tools for comprehensive downstream analysis with publication-quality visualizations.
PhenotypeToGeneDownloaderR integrates multiple biological databases to provide a unified approach to phenotype-specific gene discovery and analysis. In the current repository state, there are 12 active database downloader scripts plus a master coordinator.
- R Gene Retrieval Pipeline: phenotype-first database retrieval
- Python Analysis Suite: downstream overlap, frequency, enrichment, and summary analytics
- Standardized Output: per-source CSVs plus combined outputs
- Publication-Quality Visualizations: high-resolution plots and tabular reports
| Database | Script | Description | Gene Extraction Method |
|---|---|---|---|
| PubMed + PubTator3 | pubmed.R |
Literature mining | PMID search + PubTator3 gene annotations |
| OMIM | omim.R |
Mendelian disorders | OMIM API first, HTML scraping fallback |
| STRING-DB | string_db.R |
Protein interactions | API-based interaction network analysis |
| DisGeNET | disgenet.R |
Gene-disease associations | UMLS disease lookup + disease2gene scoring |
| ClinVar | clinvar.R |
Clinical variants | NCBI ClinVar VCV XML structured parsing |
| Reactome | reactome_pathways.R |
Biological pathways | Pathway title matching + gene mapping |
| KEGG | kegg.R |
Pathways | Pathway-title scoring + direct GENE-field parsing |
| HPO | hpo.R |
Human phenotypes | Local ontology/annotation matching + fallback known genes |
| GTEx | gtex.R |
Gene expression | Dynamic tissue ranking by PubMed + eQTL filtering |
| UniProt | uniprot.R |
Protein knowledgebase | Multi-query REST retrieval of reviewed human entries |
| Open Targets | opentargets.R |
Disease-target associations | GraphQL disease search + associated targets pagination |
| GWAS Catalog | gwasrapidd.R |
Variant-gene mappings | Expanded term search via EFO/reported trait APIs |
- R β₯ 4.0.0
- Python β₯ 3.8
- Internet connection for database access
- Clone the repository:
git clone https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR.git
cd PhenotypeToGeneDownloaderR- Install R dependencies:
Rscript requirements.R- Install Python dependencies:
pip install -r requirements.txt
# OR using conda:
conda env create -f environment.yml
conda activate gene-analysisDownload genes for a specific phenotype from all databases:
# Download genes for migraine
Rscript download_genes.R migraine
# Download genes for diabetes
Rscript download_genes.R diabetes
# Force re-download (ignore existing files)
Rscript download_genes.R migraine --forceOutput: Creates AllPackagesGenes/ directory with CSV files:
migraine_pubmed_pubtator.csvmigraine_pubmed_genes.csvmigraine_omim_genes.csvmigraine_string_db_genes.csv- ... (one file per database)
Run comprehensive analysis on downloaded genes:
# Analyze migraine genes
python download_genes_analysis.py migraine
# Analyze diabetes genes
python download_genes_analysis.py diabetesPurpose: Retrieves phenotype-linked publications, then extracts genes from PubTator3 annotations.
Required Packages: httr, jsonlite, xml2
Command: Rscript pubmed.R <phenotype>
Primary outputs:
{phenotype}_pubmed_pubtator.csv{phenotype}_pubmed_pubtator_detailed.csv{phenotype}_pubmed_pubtator_metadata.csv{phenotype}_pubmed_genes.csv
flowchart TD
A[Phenotype] --> B[Build PubMed search queries]
B --> C[NCBI ESearch returns PMIDs]
C --> D[NCBI EFetch article metadata]
C --> E[PubTator3 BioCJSON annotations]
E --> F[Extract gene mentions and Entrez IDs]
D --> G[Merge metadata with annotations]
F --> G
G --> H[Summarize per gene by PMID support]
H --> I[Write summary, detailed, metadata, genes-only CSVs]
How it works (text explanation):
- The script creates multiple PubMed search strategies for the same phenotype, so it captures broader literature coverage.
- It first collects PMIDs, then fetches article metadata and PubTator3 gene annotations separately.
- Gene mentions are normalized and merged with article context, then summarized by evidence strength (
PMID_Count). - Final outputs include a summary table, detailed mention-level evidence, metadata, and a genes-only file.
Purpose: Uses OMIM API when available; falls back to OMIM web scraping if needed.
Required Packages: httr, jsonlite, rvest, stringr
Command: Rscript omim.R <phenotype>
Primary outputs:
{phenotype}_omim.csv{phenotype}_omim_genes.csv
flowchart TD
A[Phenotype] --> B[Build OMIM search terms]
B --> C{OMIM API key available?}
C -->|Yes| D[Search OMIM API entries]
D --> E[Extract geneMap symbols and phenotype mapping]
C -->|No or empty| F[Scrape OMIM search pages]
F --> G[Extract candidate gene symbols from page text]
E --> H[Deduplicate and sort]
G --> H
H --> I[Write OMIM full and genes-only CSVs]
How it works (text explanation):
- The script builds phenotype-specific OMIM search terms and attempts the official OMIM API first.
- If API access is unavailable or returns no hits, it falls back to OMIM search-page scraping.
- Both paths produce candidate gene symbols, which are cleaned, deduplicated, and sorted.
- It writes one detailed OMIM result file and one genes-only file.
Purpose: Resolves phenotype text to STRING proteins, then expands via interaction partners.
Required Packages: httr, jsonlite
Command: Rscript string_db.R <phenotype>
Primary outputs:
{phenotype}_string_db.csv{phenotype}_string_db_genes.csv
flowchart TD
A[Phenotype] --> B[STRING get_string_ids]
B --> C[Seed proteins preferredName]
C --> D[Loop seeds: interaction_partners API]
D --> E[Filter human partners score >= 400]
E --> F[Collect preferredName_B and metadata]
F --> G[Deduplicate by Gene and rank by score]
G --> H[Write STRING full and genes-only CSVs]
How it works (text explanation):
- The phenotype text is resolved into STRING seed proteins using
get_string_ids. - For each seed, the script queries interaction partners and keeps human partners above the confidence threshold.
- Partner genes are collected with interaction scores and STRING IDs.
- Genes are deduplicated and ranked by interaction strength before export.
Purpose: Resolves disease identifiers, then fetches scored gene-disease associations.
Required Packages: dplyr, devtools, disgenet2r
Command: Rscript disgenet.R <phenotype> [database] [min_score] [max_score]
Primary outputs:
{phenotype}_disgenet.csv{phenotype}_disgenet_genes.csv
flowchart TD
A[Phenotype] --> B[get_umls_from_vocabulary]
B --> C[Pick best disease row and ID]
C --> D[Build query candidates ID/UMLS_/phenotype]
D --> E[Try disease2gene until first successful hit]
E --> F[Extract associations table]
F --> G[Build genes-only from gene_symbol column]
G --> H[Write DisGeNET full and genes-only CSVs]
How it works (text explanation):
- The phenotype is mapped to disease identifiers through DisGeNET/UMLS lookup.
- The script builds multiple candidate disease queries (ID variants and phenotype text fallback).
- It runs
disease2genewith score filtering and keeps the first successful association table. - Results are exported as a full association file plus a genes-only list.
Purpose: Searches ClinVar IDs and parses structured VCV XML fields for gene symbols.
Required Packages: rentrez, xml2, dplyr
Command: Rscript clinvar.R <phenotype> [max_results]
Primary outputs:
{phenotype}_clinvar.csv{phenotype}_clinvar_genes.csv
flowchart TD
A[Phenotype] --> B[Generate ClinVar search term variants]
B --> C[Entrez search gets variation IDs]
C --> D[Batch Entrez fetch VCV XML]
D --> E[Parse structured fields only]
E --> F[Extract genes, variation IDs, significance, conditions]
F --> G[Deduplicate by Gene and VariationID]
G --> H[Write ClinVar full and genes-only CSVs]
How it works (text explanation):
- The script searches ClinVar using multiple phenotype-oriented query forms.
- It fetches VCV XML in batches and parses structured tags (not loose free-text regex extraction).
- It extracts gene symbols plus variant IDs, conditions, and clinical significance metadata.
- Gene-variant records are deduplicated and written to full + genes-only outputs.
Purpose: Matches phenotype text to human Reactome pathway names, then maps pathway genes.
Required Packages: ReactomePA, reactome.db, org.Hs.eg.db, AnnotationDbi, dplyr, stringr
Command: Rscript reactome_pathways.R <phenotype>
Primary outputs:
{phenotype}_reactome_pathways.csv{phenotype}_reactome_pathways_genes.csv
flowchart TD
A[Phenotype] --> B[Load all human Reactome pathways]
B --> C[Build phrase and word regex patterns]
C --> D[Match relevant pathway IDs]
D --> E[Limit pathway set max 50]
E --> F[Map PATHID to Entrez IDs]
F --> G[Map Entrez IDs to gene symbols]
G --> H[Deduplicate genes and write CSVs]
How it works (text explanation):
- The script loads all human Reactome pathways from local annotation databases.
- It builds regex patterns from the phenotype text and finds matching pathway names.
- Matched pathway IDs are converted to Entrez IDs, then mapped to gene symbols.
- Genes are deduplicated and exported with pathway context.
Purpose: Scores pathway-title relevance to phenotype, then parses genes directly from matched pathways.
Required Packages: KEGGREST
Command: Rscript kegg.R <phenotype>
Primary outputs:
{phenotype}_kegg.csv{phenotype}_kegg_genes.csv{phenotype}_kegg_pathways.csv
flowchart TD
A[Phenotype] --> B[List all hsa KEGG pathways]
B --> C[Score title match phrase plus word overlap]
C --> D[Select matched pathways]
D --> E[Fetch each pathway with keggGet]
E --> F[Parse alternating GENE id and description pairs]
F --> G[Extract gene symbols and KEGG gene IDs]
G --> H[Deduplicate and write full, genes-only, pathways-only CSVs]
How it works (text explanation):
- KEGG pathway titles are scored against the phenotype text (exact phrase + keyword overlap).
- Only matched pathways are fetched in detail; this is the key indirect mapping step.
- Genes are parsed directly from KEGG pathway
GENEfields (ID/description pairs). - Outputs include full pathway-gene mappings, genes-only, and pathways-only tables.
Purpose: Uses downloaded HPO resources with cached parsing and pattern-based matching.
Required Packages: ontologyIndex, httr, utils
Command: Rscript hpo.R <phenotype>
Primary outputs:
{phenotype}_hpo.csv{phenotype}_hpo_genes.csv
flowchart TD
A[Phenotype] --> B[Download/cache HPO files]
B --> C[Load gene-phenotype annotations]
C --> D[Create phenotype search patterns]
D --> E[Match HPO terms and disease names]
E --> F{Any matches?}
F -->|Yes| G[Score, deduplicate, rank associations]
F -->|No| H[Fallback to known phenotype-gene list]
G --> I[Write HPO full and genes-only CSVs]
H --> I
How it works (text explanation):
- HPO resources are downloaded/cached locally (ontology + annotation files).
- The phenotype is expanded into search patterns and matched against HPO term and disease annotations.
- Hits are scored by match type (exact/contains/pattern), then deduplicated.
- If no ontology hits are found, a fallback known-association gene set is used.
Purpose: Ranks tissues by phenotype-specific literature support, then collects significant eGenes.
Required Packages: gtexr, dplyr, httr, jsonlite
Command: Rscript gtex.R <phenotype> [q_threshold] [max_genes_per_tissue] [max_tissues]
Primary outputs:
{phenotype}_gtex_ranked_tissues.csv{phenotype}_gtex_tissue_eqtls.csv{phenotype}_gtex_prioritized_genes.csv{phenotype}_gtex_genes.csv
flowchart TD
A[Phenotype] --> B[Fetch available GTEx tissues]
B --> C[Query PubMed counts for phenotype plus tissue]
C --> D[Rank tissues and select top N]
D --> E[Fetch tissue eGenes using gtexr]
E --> F[Filter by q-value or p-value threshold]
F --> G[Aggregate per-gene across tissues]
G --> H[Write ranked tissues, eQTL rows, prioritized genes, genes-only CSVs]
How it works (text explanation):
- The script first ranks GTEx tissues by PubMed co-mention counts with the phenotype.
- It queries eGenes for top-ranked tissues and filters using significance thresholds.
- Tissue-level gene signals are aggregated into prioritized gene summaries.
- It exports ranked tissues, full tissue-eQTL rows, prioritized genes, and genes-only files.
Purpose: Executes multiple reviewed-human UniProt queries and extracts primary/synonym gene names.
Required Packages: httr, jsonlite
Command: Rscript uniprot.R <phenotype>
Primary outputs:
{phenotype}_uniprot.csv{phenotype}_uniprot_genes.csv
flowchart TD
A[Phenotype] --> B[Build four UniProt query templates]
B --> C[Run REST search reviewed human entries]
C --> D[Extract geneName and synonyms from each result]
D --> E[Attach UniProt accession and protein name]
E --> F[Deduplicate genes across query types]
F --> G[Write UniProt full and genes-only CSVs]
How it works (text explanation):
- Multiple UniProt query templates are generated to capture different annotation fields.
- Searches are restricted to reviewed human records for quality control.
- Primary and synonym gene names are extracted from each hit and merged.
- Final deduplicated genes are written with UniProt metadata.
Purpose: Finds disease ID by phenotype term, then paginates associated targets.
Required Packages: httr, jsonlite
Command: Rscript opentargets.R <phenotype>
Primary outputs:
{phenotype}_opentargets.csv{phenotype}_opentargets_genes.csv
flowchart TD
A[Phenotype] --> B[GraphQL disease search query]
B --> C[Take top matched disease ID]
C --> D[Paginate disease associatedTargets]
D --> E[Extract approved symbol, ID, name, score]
E --> F[Write OpenTargets full and genes-only CSVs]
How it works (text explanation):
- A GraphQL search resolves phenotype text to the closest disease entry.
- The selected disease ID is used to request associated targets page-by-page.
- Target symbol, target ID, gene name, and association score are extracted.
- Results are saved as a full table and a genes-only file.
Purpose: Expands phenotype terms and queries both EFO and reported traits for variant-gene contexts.
Required Packages: gwasrapidd, dplyr
Command: Rscript gwasrapidd.R <phenotype>
Primary outputs:
{phenotype}_gwasrapidd.csv{phenotype}_gwasrapidd_genes.csv
flowchart TD
A[Phenotype] --> B[Build expanded phenotype search terms]
B --> C[Discover additional trait terms from GWAS]
C --> D[Run EFO-trait and reported-trait variant queries]
D --> E[Extract genomic_context gene mappings]
E --> F[Combine and deduplicate gene-variant rows]
F --> G[Write GWAS full and genes-only CSVs]
How it works (text explanation):
- The phenotype is expanded into additional trait phrases (generic and phenotype-specific expansions).
- The script queries GWAS both by EFO trait and reported trait for each candidate term.
- Variant genomic-context mappings are converted to gene-centric rows.
- Combined rows are deduplicated and exported as full GWAS mappings plus genes-only output.
Purpose: Executes configured database scripts in a coordinated manner.
Features:
- Smart file existence checking
- Force re-download option (
--force) - Progress tracking and error handling
- Comprehensive execution summary
- Combined outputs (
*_ALL_SOURCES_GENES.csv,*_SOURCES_SUMMARY.csv)
Command:
Rscript download_genes.R <phenotype>
Rscript download_genes.R <phenotype> --forceExecution Flow:
- Validates phenotype input
- Creates
AllPackagesGenes/output directory - Executes each database script sequentially
- Captures success/failure status and runtime
- Generates execution summary report
Repository alignment note:
- The coordinator currently includes legacy references to
gene_ontology.Randpubmed_pubtator.R. - The current repository includes
pubmed.Rand does not includegene_ontology.R. - If needed, update the
scripts_infomapping indownload_genes.Rto keep code and docs fully synchronized.
PhenotypeToGeneDownloaderR/
βββ README.md # This comprehensive documentation
βββ requirements.R # R package installer script
βββ requirements.txt # Python dependencies
βββ environment.yml # Conda environment file
βββ download_genes.R # Master R coordination script
βββ download_genes_analysis.py # Master Python analysis script
β
βββ Individual R Scripts (active in this repository)/
β βββ pubmed.R # PubMed literature mining
β βββ omim.R # OMIM genetic disorders
β βββ string_db.R # STRING protein interactions
β βββ disgenet.R # DisGeNET gene-disease associations
β βββ clinvar.R # ClinVar clinical variants
β βββ reactome_pathways.R # Reactome biological pathways
β βββ kegg.R # KEGG metabolic pathways
β βββ hpo.R # HPO human phenotypes
β βββ gtex.R # GTEx gene expression
β βββ uniprot.R # UniProt protein database
β βββ opentargets.R # OpenTargets drug targets
β βββ gwasrapidd.R # GWAS Catalog associations
β
βββ GenePlots/ # Python analysis modules
β βββ Analysis1_SourceComparison.py # Database performance analysis
β βββ Analysis2_GeneFrequency.py # Gene frequency distributions
β βββ Analysis3_DatabaseOverlap.py # Venn diagrams and overlaps
β βββ Analysis4_GeneSetEnrichment.py # Pathway enrichment analysis
β βββ Analysis6_StatisticalSummary.py # Comprehensive statistics
β
βββ AllPackagesGenes/ # R script outputs (CSV files)
βββ AllAnalysisGene/ # Python analysis outputs
βββ plots/ # Publication-quality visualizations
βββ reports/ # Statistical reports and summaries
βββ data/ # Processed gene lists and metadata
Install all dependencies automatically:
Rscript requirements.RManual Installation:
Core CRAN Packages:
install.packages(c(
"dplyr", "readr", "stringr", "httr", "jsonlite", "xml2",
"rvest", "rentrez", "ontologyIndex", "gtexr", "gwasrapidd", "devtools"
))Bioconductor Packages:
if (!require(BiocManager, quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c(
"ReactomePA", "reactome.db", "org.Hs.eg.db", "AnnotationDbi",
"KEGGREST"
))
# DisGeNET package is installed from GitLab in disgenet.R
devtools::install_gitlab("medbio/disgenet2r")Install Python analysis dependencies:
pip install -r requirements.txtOr using conda:
conda env create -f environment.yml
conda activate gene-analysisOutputs are source-specific (not a single universal schema). Each script writes at least one full CSV and a genes-only CSV.
Core per-source outputs:
- PubMed:
{phenotype}_pubmed_pubtator.csv,{phenotype}_pubmed_genes.csv - OMIM:
{phenotype}_omim.csv,{phenotype}_omim_genes.csv - STRING-DB:
{phenotype}_string_db.csv,{phenotype}_string_db_genes.csv - DisGeNET:
{phenotype}_disgenet.csv,{phenotype}_disgenet_genes.csv - ClinVar:
{phenotype}_clinvar.csv,{phenotype}_clinvar_genes.csv - Reactome:
{phenotype}_reactome_pathways.csv,{phenotype}_reactome_pathways_genes.csv - KEGG:
{phenotype}_kegg.csv,{phenotype}_kegg_genes.csv,{phenotype}_kegg_pathways.csv - HPO:
{phenotype}_hpo.csv,{phenotype}_hpo_genes.csv - GTEx:
{phenotype}_gtex_ranked_tissues.csv,{phenotype}_gtex_tissue_eqtls.csv,{phenotype}_gtex_prioritized_genes.csv,{phenotype}_gtex_genes.csv - UniProt:
{phenotype}_uniprot.csv,{phenotype}_uniprot_genes.csv - Open Targets:
{phenotype}_opentargets.csv,{phenotype}_opentargets_genes.csv - GWAS Catalog:
{phenotype}_gwasrapidd.csv,{phenotype}_gwasrapidd_genes.csv
Combined coordinator outputs:
{phenotype}_ALL_SOURCES_GENES.csv{phenotype}_SOURCES_SUMMARY.csv
Example real column sets:
*_pubmed_pubtator.csv:Gene,Entrez_ID,PMID_Count,Supporting_PMIDs,Phenotype,Source*_clinvar.csv:Gene,ClinVar_VariationID,ClinVar_Accession,Query_Phenotype,Matched_Condition,Clinical_Significance,Search_Term,Source*_kegg.csv:Phenotype,Pathway_ID,Pathway_Name,KEGG_Gene_ID,Gene,Source*_gwasrapidd.csv:Gene,Variant_ID,Chromosome,Position,Distance,Is_Mapped_Gene,Is_Closest_Gene, ...
Directory Structure:
AllAnalysisGene/
βββ plots/ # Publication-quality visualizations (PNG, PDF)
βββ reports/ # Statistical summaries and data tables (CSV, TXT)
βββ data/ # Processed gene lists and metadata
Generated Files:
{phenotype}_Analysis2_GeneFrequency.png/pdf{phenotype}_Analysis3_DatabaseOverlap.png/pdf{phenotype}_Analysis3_AllPairwiseVenns.png/pdf{phenotype}_Analysis3_ThreeWayVenns.png/pdf{phenotype}_Analysis3_TopDatabaseVenns.png/pdf{phenotype}_Analysis3_VennDiagrams.png/pdf{phenotype}_Analysis3_SimilarityHeatmap.png/pdf{phenotype}_Analysis4_GeneSetEnrichment.png/pdf{phenotype}_Analysis6_StatisticalSummary.png/pdf{phenotype}_ExecutiveSummary.txt{phenotype}_Analysis2_GeneFrequency_Report.csv{phenotype}_Analysis3_OverlapSummary.csv{phenotype}_Analysis4_EnrichmentResults.csv{phenotype}_Analysis6_FinalSummary.csv{phenotype}_all_genes.csv{phenotype}_summary_stats.csv
# Complete pipeline for migraine research
Rscript download_genes.R migraine
python download_genes_analysis.py migraine
# Complete pipeline for diabetes research
Rscript download_genes.R diabetes
python download_genes_analysis.py diabetes# Query specific databases
Rscript pubmed.R migraine
Rscript omim.R migraine
Rscript string_db.R migraine# Ignore existing files and re-download
Rscript download_genes.R migraine --force# Load and run individual scripts
source("pubmed.R")
migraine_results <- download_pubmed_pubtator_genes("migraine")
# View results
head(migraine_results$summary)
table(migraine_results$summary$Source)The Python analysis pipeline generates comprehensive visualizations and statistical reports:
1. Source Comparison Analysis
- Database performance metrics
- Gene count comparisons
- Success rate analysis
- Execution time tracking
Example: Database performance comparison showing gene counts and success rates across all available databases
2. Gene Frequency Analysis
- Gene occurrence across databases
- Most frequently identified genes
- Database-specific gene distributions
- Statistical frequency analysis
Example: Gene frequency distribution showing how often genes appear across different databases
3. Database Overlap Analysis
- Comprehensive Venn Diagrams: All pairwise combinations
- Three-way overlaps: Top-performing database triplets
- Jaccard Similarity Heatmaps: Statistical overlap measurements
- Database-specific legends: Color-coded identification
Example: Comprehensive Venn diagram analysis showing gene overlaps between databases
Example: Jaccard similarity heatmap showing statistical overlap between all database pairs
4. Gene Set Enrichment Analysis
- Pathway enrichment testing
- Functional annotation clustering
- GO term over-representation
- Statistical significance testing
5. Statistical Summary
- Executive summary reports
- Comprehensive statistics
- Publication-ready tables
- Key findings and recommendations
All visualizations are generated with:
- High Resolution: 300 DPI for publication
- Multiple Formats: PNG and PDF outputs
- Professional Styling: Consistent color schemes and fonts
- Statistical Annotations: P-values and confidence intervals
The pipeline includes sophisticated Venn diagram analysis:
- All Pairwise Comparisons: Every database pair analyzed
- Three-Way Overlaps: Top-performing database combinations
- Comprehensive Analysis: pairwise and multi-database overlap views in one workflow
- Statistical Annotations: Jaccard similarity coefficients
- Database-Specific Legends: Color-coded database identification
Example: All pairwise Venn diagrams showing detailed overlap analysis between database pairs
Individual Gene CSV Files (AllPackagesGenes/) Each source writes its own schema. Example from PubMed summary output:
Gene,Entrez_ID,PMID_Count,Supporting_PMIDs,Phenotype,Source
CACNA1A,773,24,"12345;23456;34567",migraine,PubMed+PubTator3
SCN1A,6323,18,"11111;22222",migraine,PubMed+PubTator3
TRPV1,7442,14,"33333;44444",migraine,PubMed+PubTator3Comprehensive Analysis Output (AllAnalysisGene/)
The analysis pipeline generates multiple visualization types:
Gene count distribution across available databases showing retrieval efficiency
Most frequently identified genes across databases with occurrence frequencies
Database-specific performance metrics including gene counts, success rates, and execution times
Detailed overlap matrix showing shared genes between all database pairs
Comprehensive statistical dashboard with key findings and recommendations
1. Missing R Packages
# Automatic installation
Rscript requirements.R
# Manual troubleshooting
install.packages("package_name")
BiocManager::install("bioc_package_name")2. Network and API Issues
- All scripts include timeout handling
- Rate limiting with automatic delays
- Graceful failure handling
- Retry mechanisms for transient errors
3. Empty Results
- Check phenotype spelling and terminology
- Some databases may lack associations for rare conditions
- Verify internet connectivity
- Check database service status
4. Permission and Directory Issues
# Re-run coordinator; it creates required output directories automatically
Rscript download_genes.R migraineEnable detailed logging:
# Python analysis prints execution details to console
python download_genes_analysis.py migraineLarge-scale processing:
- Scripts handle memory efficiently
- Results are streamed and processed incrementally
- Automatic cleanup of temporary files
- Progress tracking for long-running operations
1. Literature Mining (PubMed)
- Natural language processing of abstracts
- Gene mention extraction and validation
- Co-occurrence analysis with phenotype terms
- Evidence strength based on publication frequency
2. Clinical Databases (ClinVar, OMIM)
- Curated clinical variant annotations
- Genetic disorder classifications
- Pathogenicity assessments
- Clinical significance scores
3. Functional Genomics (GO, KEGG, Reactome)
- Pathway-based gene discovery
- Functional annotation analysis
- Biological process categorization
- Molecular function mapping
4. Network Biology (STRING)
- Protein-protein interaction networks
- Network topology analysis
- Interaction confidence scoring
- Network-based gene prioritization
5. Expression Analysis (GTEx)
- Tissue-specific expression profiles
- Expression quantitative trait loci (eQTLs)
- Co-expression network analysis
- Expression level thresholding
Overlap Analysis:
- Jaccard similarity coefficients
- Hypergeometric enrichment testing
- Multiple testing correction (FDR)
- Bootstrap confidence intervals
Quality Metrics:
- Database coverage assessment
- Gene annotation completeness
- Cross-database validation
- Evidence aggregation scoring
This project is licensed under the MIT License. See LICENSE file for full details.
Please also cite the original databases:
- PubMed: NCBI Resource Coordinators (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research.
- OMIM: Amberger et al. (2019) OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Research.
- STRING: Szklarczyk et al. (2023) The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Research.
- DisGeNET: PiΓ±ero et al. (2020) The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research.
- ClinVar: Landrum et al. (2020) ClinVar: improvements to accessing data. Nucleic Acids Research.
- Reactome: Gillespie et al. (2022) The reactome pathway knowledgebase 2022. Nucleic Acids Research.
- KEGG: Kanehisa et al. (2023) KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research.
- HPO: KΓΆhler et al. (2021) The Human Phenotype Ontology in 2021. Nucleic Acids Research.
- GTEx: GTEx Consortium (2020) The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science.
- UniProt: UniProt Consortium (2023) UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research.
- Open Targets: Ochoa et al. (2023) The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Research.
- GWAS Catalog: Sollis et al. (2023) The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research.
- Gene Ontology: Ashburner et al. (2000) Gene ontology: tool for the unification of biology. Nature Genetics.
Documentation:
- This comprehensive README
- Inline code comments
- Script-specific help messages
Community Support:
- Name: Muhammad Muneeb
- Affiliation: The University of Queensland
- Email: m.muneeb@uq.edu.au
- Gmail: muneebsiddique007@gmail.com
- GitHub: GitHub Profile
- Google Scholar: Google Scholar
- ResearchGate: ResearchGate Profile
- Supervisor: David Ascher
- Group Webpage: BioSig Lab
Direct Contact:
- Maintainer: Muhammad Muneeb
- Institution: The University of Queensland
- Research Group: BioSig Lab
When reporting issues, please include:
- System Information: OS, R version, Python version
- Error Messages: Complete error logs
- Reproducible Example: Minimal working example
- Expected Behavior: What should have happened
- Actual Behavior: What actually happened
For new features, please describe:
- Use Case: Scientific problem to solve
- Proposed Solution: How it should work
- Alternative Approaches: Other ways to address the need
- Implementation Details: Technical considerations
Last Updated: April 9, 2026
Version: 1.0.0
Maintainer: Muhammad Muneeb
License: MIT License