HeartBioPortal DataHub is a version-controlled collection of cardiovascular omics datasets. Each dataset includes standardised metadata and provenance information so that analyses can be reproduced and referenced.
git clone <repo-url>
cd DataHub
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
make validate
# or using docker
docker compose up validationUse this flow on servers (AWS, HPC login node, on-prem VM):
git clone https://github.com/HeartBioPortal/DataHub.git
cd /DataHub
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtIf you run pip from the parent directory instead of inside /DataHub, use:
pip install -r DataHub/requirements.txtThis repository uses Git LFS for storing large binary datasets. Install Git LFS before cloning:
git lfs installDatasets are organised under public/ for open data or private/ for embargoed submissions. A typical dataset directory contains:
<dataset>/
metadata.json # descriptive metadata
provenance.json # processing provenance
data files...
The JSON schemas that describe these files live under schemas/ and are also rendered in the documentation.
Use the helper script to check a dataset before opening a pull request:
tools/hbp-validate public/example_fh_vcfor run all tests with make validate.
DataHub now includes a modular OOP pipeline under src/datahub/ with:
- adapters (source-specific ingestion),
- contracts and quality policies (required fields + missing-value handling),
- enrichment hooks with configurable source priority,
- storage backends (
DuckDB + Parquet), - publishers that emit legacy HeartBioPortal-compatible association JSON.
To build association outputs from the legacy aggregated CSV while keeping the current frontend payload shape:
scripts/build_legacy_association.py \
--csv-path ../DataManager/raw_data/final_aggregated_results.csv \
--output-root ../DataManager/analyzed_data \
--duckdb-path ./artifacts/canonical.duckdb \
--parquet-path ./artifacts/canonical.parquet \
--publish-redisField requirements are configurable via --required-fields and phenotype
routing can be overridden via --phenotype-map-json. Redis loading uses the
existing DataManager exporter so backend cache behavior remains unchanged.
Use --ancestry-precision and default ancestry-point deduplication to reduce
JSON payload size while keeping the same response shape.
DataHub includes a profile-driven preparation layer to clean legacy/raw source columns into a stable schema before aggregation.
Built-in preparation profiles:
legacy_cvd_rawlegacy_trait_raw
Run preparation:
scripts/prepare_association_raw.py \
--input-csv /path/to/raw_cvd.csv \
--output-csv /path/to/prepared_cvd.csv \
--profile legacy_cvd_rawPrepared output columns are standardized and include:
rsid,pval,gene,phenotype,functional_class,var_class,clinical_significance,most_severe_consequence,allele_string,protein_start,protein_end,ancestry_data
plus provenance fields such as study, pmid, and study_genome_build.
Dataset-type validation profiles are first-class JSON configs in
config/profiles/:
association.jsonexpression.jsonsingle_cell.json
These profiles can be loaded through DatasetProfileLoader and converted into
runtime DatasetContract objects for pipeline validation.
DataHub includes a pluggable adapter registry so community submissions can add new ingestion adapters without modifying core orchestration code.
Built-in adapter IDs:
legacy_association_csvgwas_associationensembl_associationclinvar_associationmvp_association
DataHub also includes a source-management layer under src/datahub/sources/
that maps source manifests in config/sources/ to adapter instances.
Built-in source IDs:
gwas_catalogensembl_variationclinvar
Additional source manifests are included for cataloging major databases across these categories:
cvd_focused_portalsgwas_statistical_geneticspopulation_reference_variationclinical_variant_interpretationbulk_transcriptomics_qtlsingle_cell_spatialepigenomics_regulatoryproteomicsmetabolomics_lipidomicspathways_interactions_networksdrug_target_translationalontologies_standards
Most of these are marked catalog_only until a dedicated adapter is wired.
Run configurable ingestion with:
scripts/run_ingestion.py --config path/to/ingestion.jsonThe JSON config can define profile, adapter list, source list, optional plugin adapters and source connectors, storage, enrichment source priority, and publishers.
Built-in publisher IDs:
legacy_associationlegacy_redisphenotype_rollup
Example source-driven config:
{
"profile": "association",
"sources": [
{
"id": "gwas_catalog",
"params": {
"input_paths": "/path/to/gwas_results.csv",
"dataset_id": "hbp_gwas_snapshot_2026_02"
}
}
],
"publishers": [
{
"name": "legacy_association",
"params": {
"output_root": "/path/to/analyzed_data"
}
}
]
}Dataset-specific entrypoints are organized under:
scripts/dataset_specific_scripts/
MVP integration scripts are available in:
scripts/dataset_specific_scripts/mvp/run_mvp_pipeline.pyscripts/dataset_specific_scripts/mvp/export_mvp_prepared_raw.pyscripts/dataset_specific_scripts/mvp/README.md
Unified DuckDB-first scripts (legacy raw + MVP) are available in:
scripts/dataset_specific_scripts/unified/ingest_legacy_raw_duckdb.pyscripts/dataset_specific_scripts/unified/publish_unified_from_duckdb.pyscripts/dataset_specific_scripts/unified/run_unified_pipeline.pyscripts/dataset_specific_scripts/unified/README.md
Runtime execution profiles for local/AWS/HPC orchestration:
config/runtime_profiles/unified_pipeline_profiles.json
The unified runner supports the same pipeline on local servers, AWS, and HPC with profile-based configuration:
python3 scripts/dataset_specific_scripts/unified/run_unified_pipeline.py \
--profile local_laptop \
--step all \
--reset-publish-output \
--log-level INFOBigRed200 profile is available as bigred200_hpc and supports Slurm
dependency chaining:
python3 scripts/dataset_specific_scripts/unified/run_unified_pipeline.py \
--profile bigred200_hpc \
--mode slurm \
--submit-slurm \
--reset-publish-checkpoint \
--reset-publish-output \
--log-level INFOUse --set key=value for machine-specific overrides without changing source:
python3 scripts/dataset_specific_scripts/unified/run_unified_pipeline.py \
--profile bigred200_hpc \
--dry-run \
--set publish.per_gene_shards=2048 \
--set slurm.partition=cpuThe runner uses the same Python interpreter that launches it (so running from
an activated .venv carries into Slurm jobs). Override with
--python-executable when needed.
For HPC module environments, add setup commands per job with
--slurm-setup-command "module load python/3.11".
For high-throughput publish on HPC, the unified publish script supports
deterministic unit partitioning via --unit-partitions and
--unit-partition-index so multiple jobs can process disjoint shard subsets
in parallel. Use --resume-seed-checkpoint when moving from a previous
single-run checkpoint to partitioned publish so completed units are not redone.
We welcome new datasets and improvements. See CONTRIBUTING.md for a walkthrough of the submission process and consult the files in the docs/ directory for more details.