Skip to content

alchemia-db/alchemia

Repository files navigation

Alchemia — Open-Source Molecular Data Bank

Early Access

Python 3.11 RDKit PyTorch CUDA License

8.7M+ Standardized Molecules AIMNet2 3D Minimization admet_ai Predictions 4+ Molecular Fingerprints Docker


Early Access

We're processing millions of molecules. While we prepare the full dataset for public release, sign up for early access — we'll notify you by email when we launch.

→ preview.alchemiadatabase.com/pt ←
Be among the first to access the Alchemia dataset.


Overview

ALCHEMIA is an open-source molecular data bank built for ultra-large virtual screening (ULVS) in drug discovery. It aggregates molecules from 6 public databases, standardizes them through a reproducible, GPU-accelerated Python pipeline, and delivers research-ready datasets at scale.

Phase 01 deliverables:

  • Standardized chemical structures — SMILES, InChI, InChIKey, cross-source deduplication (ALC_XXXXXXX compound keys)
  • Energy-minimized 3D conformers — AIMNet2 neural network potential (near-QM accuracy) via Auto3D, MMFF94s fallback
  • ADMET property predictions — 30+ endpoints via ADMET-AI (absorption, distribution, metabolism, excretion, toxicity)
  • Molecular fingerprints — ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet, PostgreSQL bfp compatible
  • Drug-likeness & structural filters — Lipinski, Veber, Ghose, QED, PAINS (480+), Brenk, NIH, toxicophores
  • PDBQT-ready ligands — meeko-based preparation for AutoDock/Vina workflows
  • PostgreSQL database — RDKit cartridge + pgvector + pg_trgm for similarity search at scale

All pipeline stages are chunked, checkpointed, and resumable. No stage ever loads a full SDF/MOL2/CSV into memory.


Phase 01 Pipeline

Alchemia Phase 01 Pipeline

Raw sources (SDF / MOL2 / CSV)
  → Audit & Inventory             (source manifest, file checksums)
  → Standardization + Cross-Dedup (RDKit MolStandardize, InChI keys, ALC_XXXXXXX)
  → 2D Properties + Drug-Likeness (MW, logP, TPSA, HBD/HBA, QED, Lipinski/Veber/Ghose)
  → Structural Filters            (PAINS 480+, Brenk, NIH, toxicophores — flags, not exclusions)
  → Fingerprints                  (ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet)
  → Conformer Generation          (ETKDGv3, MMFF94s geometry refinement, 3D QC)
  → AIMNet2 Energy Minimization   (GPU, near-QM accuracy, H/C/N/O/F/S/Cl)
  → 3D SDF Export                 (AIMNet2-ok first, MMFF94s fallback)
  → ADMET Predictions             (GPU, 30+ endpoints, ADMET-AI)
  → PDBQT Preparation             (meeko, AutoDock4 format)
  → Molecular Classification      (196 SMARTS classes)
  → Visualization                 (RDKit mol grid images)
  → PostgreSQL Load               (RDKit cartridge + pgvector + pg_trgm)
  → Validation & Audit            (QC log, qc_failures.parquet)

The pipeline is orchestrated with Snakemake (14 rules) and runs on Docker images optimized for NVIDIA DGX clusters (8× A100 80GB) or local GPU workstations (RTX 5070+).


Source Databases

Database Description Compounds Phase 01 Status Website
BrNPDB Brazilian Natural Products 9,215 ✅ All stages complete + PDBQT + PostgreSQL brnpdb.org
NCI National Cancer Institute 85,495 ✅ All stages complete + PDBQT + PostgreSQL cactus.nci.nih.gov
COCONUT Natural products 725,267 Std–Props–FP–ADMET–Class ✅ · AIMNet2 🔄 7/8 shards (merge + PDBQT pending) coconut.naturalproducts.net
ChEMBL Bioactive molecules ~2,854,815 std Std–Props–FP–Class ✅ · ADMET 🔄 ~18.7% · AIMNet2 🔄 shards 0-1 active ebi.ac.uk/chembl
Enamine Screening compounds ~5,000,000+ std Std–Props–FP–Class ✅ · ADMET 🔄 running · AIMNet2 🔄 shards 0-6 active enamine.net
Molport Commercial catalog ~7,000,000+ ⏳ Not started molport.com

Standardized to date: ~8.7M+ compounds across 5 databases, deduplicated by InChI. Full pipeline (all stages) complete for BrNPDB + NCI; COCONUT completing AIMNet2 minimization; ChEMBL + Enamine ADMET and AIMNet2 running on 8× A100.


Features

Feature Details
Multi-Source COCONUT, NCI, BrNPDB, ChEMBL, Enamine, Molport — 6 databases, one unified schema
AIMNet2 Minimization GPU-accelerated 3D energy minimization, near-QM accuracy (supports H, C, N, O, F, S, Cl)
ADMET Predictions 30+ endpoints: absorption, distribution, metabolism, excretion, toxicity via ADMET-AI
5 Fingerprint Types ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet + PostgreSQL bfp
Structural Filters PAINS (480+), Brenk, NIH, toxicophores — warning flags, not hard exclusions
Drug-Likeness Lipinski, Veber, Ghose, lead-like, fragment-like, QED
196-Class SMARTS Chemical taxonomy from utils/smarts.json
PDBQT Prep meeko-based AutoDock/Vina-ready PDBQT generation (pure Python 3.11, no MGLTools)
Docker Pipeline 5 specialized images: base (3.35 GB) · cpu (3.38 GB) · gpu (36.2 GB) · pdbqt (3.43 GB) · snakemake (2.52 GB)
Streaming I/O Chunked, checkpointed, resumable — never OOM on multi-million-compound sources
Deterministic Keys compound_key = ALC_XXXXXXX (SHA256 of InChI) — stable join key across all tables
PostgreSQL-Ready RDKit cartridge + pgvector (similarity search) + pg_trgm + pgcrypto

Data Outputs

Each source produces a suite of Parquet files, SDF exports, and PDBQT files:

File Key Columns
{source}_unique_compounds.parquet compound_key, canonical_smiles, inchi, inchikey, source_compound_id
{source}_properties.parquet compound_key, mw, logp, tpsa, hbd, hba, qed, lipinski_pass
{source}_fingerprints.parquet compound_key, ecfp4, ecfp6, maccs, rdkit_fp, atompair, torsion (binary)
{source}_admet.parquet compound_key, admet_model_name, admet_model_version, predictions_json
{source}_complete_3d.sdf 3D molblocks — AIMNet2-minimized first, MMFF94s fallback, _Name = compound_key
{source}_complete_minimized.parquet compound_key, minimization_method, energy_kcal_mol, min_status
{source}_pdbqt_manifest.parquet compound_key, pdbqt_path, sha256, num_torsions, pdbqt_status
{source}_classification.parquet compound_key, matched_classes (JSON), primary_class, num_classes
master_compounds.parquet Cross-source deduplicated — compound_key (ALC_*), source_name, inchi, canonical_smiles
cross_reference.parquet source_compound_key, source_name, source_compound_id, master_compound_key

3D Structure Pipeline

Input: raw SMILES / existing 3D coordinates
  1. ETKDGv3 conformer generation (RDKit)
  2. MMFF94s geometry refinement (UFF fallback)
  3. AIMNet2 energy minimization via Auto3D
     — Supported elements: H, C, N, O, F, S, Cl
     — GPU: NVIDIA A100 / RTX 5070 (CUDA 12.8)
     — Graceful skip for unsupported elements (MMFF94s result kept)
  4. Complete 3D SDF export (AIMNet2-ok first, MMFF94s fallback)
  5. PDBQT generation via meeko (AutoDock4 format)

Installation

Prerequisites

  • Python 3.11, conda/mamba
  • NVIDIA GPU + CUDA 12.8 (for AIMNet2 minimization and ADMET-AI)
  • Docker + NVIDIA Container Toolkit (for containerized pipeline)

Local Setup

git clone https://github.com/alchemia-db/alchemia.git
cd alchemia

# Create conda environment (RDKit, Polars, PyArrow included)
conda env create -f environment.yml
conda activate alchemia-ph1

# PyTorch with CUDA 12.8 (RTX 5070 / A100 Blackwell)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Editable install
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Docker Pipeline

# Build all images (base → cpu → gpu → pdbqt → snakemake)
bash docker/build.sh

# CPU stage (standardization, properties, fingerprints)
docker run --rm -v "$PWD:/workspace" alchemia/cpu \
  python scripts/run_standardization.py --source BrNPDB --sample 1000

# GPU stage (AIMNet2 minimization)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_minimization.py --source BrNPDB --device cuda

# GPU stage (ADMET predictions)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_admet.py --source BrNPDB

# PDBQT preparation
docker run --rm -v "$PWD:/workspace" alchemia/pdbqt \
  python scripts/run_pdbqt.py --source BrNPDB

# Full Snakemake pipeline (dry-run first)
docker run --rm -v "$PWD:/workspace" alchemia/snakemake \
  snakemake --cores 8 --dry-run

Pipeline Scripts

All scripts support --sample N, --dry-run, --resume:

python scripts/audit_repository.py          # Audit & inventory
python scripts/run_standardization.py       # Standardize + cross-dedup
python scripts/run_properties.py            # 2D descriptors + drug-likeness
python scripts/run_filters.py               # PAINS / Brenk / NIH flags
python scripts/run_fingerprints.py          # ECFP4/6 / MACCS / RDKit / AtomPair
python scripts/run_conformers.py            # ETKDGv3 conformer generation (--shard-start/--shard-end/--output-suffix for parallel shards)
python scripts/patch_auto3d.py              # Patch Auto3D np.min([]) crash — must run before every AIMNet2 container
python scripts/run_minimization.py          # AIMNet2 energy minimization (--shard-start/--shard-end/--output-suffix)
python scripts/merge_minimized_sdf.py       # Complete 3D SDF export (--num-shards N)
python scripts/run_admet.py                 # ADMET-AI predictions
python scripts/run_pdbqt.py                 # PDBQT via meeko
python scripts/run_classification.py        # 196-class SMARTS taxonomy
python scripts/run_viz.py                   # Molecular image grids
python scripts/load_postgres.py             # PostgreSQL load
python scripts/orchestrate_gpus.py          # GPU watchdog — auto-dispatches pipeline tasks to idle GPUs
python scripts/dashboard.py                 # Terminal TUI — real-time pipeline progress and GPU monitoring

Repository Structure

Only GitHub-tracked files are shown (databases, pipeline outputs, and local agent files are gitignored):

alchemia/
├── .gitignore
├── LICENSE
├── README.md
├── docker-compose.yml          ← PostgreSQL, Redis, API services
├── environment.yml             ← Conda env: Python 3.11, RDKit, Polars, CUDA
├── pyproject.toml              ← Package definition + dev dependencies
│
├── assets/                     ← Logos, banners, pipeline diagrams
│   ├── logos/                  ← 6 SVG logo variants (white, dark, icon-only)
│   ├── alchemia_banner_v2.svg  ← Animated hero banner (this README)
│   └── alchemia_phase01_pipeline.svg  ← Phase 01 pipeline diagram
│
├── configs/
│   ├── hardware.yaml           ← CPU/GPU/RAM tuning, checkpoint cadence
│   ├── pipeline.yaml           ← Stage settings, batch sizes, paths
│   └── sources/                ← Per-source YAML configs (6 files)
│
├── docker/
│   ├── build.sh                ← Build all 5 images in dependency order
│   ├── base/                   ← Miniconda + conda-forge
│   ├── cpu/                    ← RDKit, Polars, Snakemake
│   ├── gpu/                    ← PyTorch 2.8 + ADMET-AI + AIMNet2 (CUDA 12.8)
│   ├── pdbqt/                  ← Python 3.11 + meeko + gemmi
│   ├── snakemake/              ← Snakemake orchestrator
│   └── postgres/               ← RDKit cartridge init SQL
│
├── pipeline/
│   ├── Snakefile               ← Main DAG entry point
│   ├── config/pipeline.yaml
│   ├── config/profiles/dgx/   ← NVIDIA DGX A100 Snakemake profile
│   └── rules/                  ← 14 .smk rule files (00_audit → 13_validate)
│
├── scripts/                    ← Thin CLI entrypoints (delegate to src/)
│   └── (13 scripts, one per pipeline stage)
│
├── src/
│   └── alchemia/               ← Main Python package
│       ├── admet/              ← ADMET-AI runner, chunked + checkpointed
│       ├── classification/     ← 196-class SMARTS classifier
│       ├── conformers/         ← ETKDGv3 generator + 3D QC
│       ├── descriptors/        ← 2D properties + drug-likeness
│       ├── filters/            ← PAINS, Brenk, NIH, toxicophores
│       ├── fingerprints/       ← ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion
│       ├── io/                 ← Streaming SDF/MOL2/CSV readers
│       ├── minimization/       ← AIMNet2 via Auto3D + MMFF94s fallback
│       ├── pdbqt/              ← meeko PDBQT preparation
│       ├── postgres/           ← Schema, staging, loaders
│       ├── sources/            ← Per-source profiler + parser
│       ├── standardization/    ← RDKit MolStandardize + InChI dedup
│       ├── utils/              ← Logging, checksums, QC logger
│       └── viz/                ← Molecular image grid generator
│
├── tests/unit/                 ← 22 test modules, 63+ tests
│
├── utils/
│   ├── pains.json              ← 480+ PAINS SMARTS patterns
│   ├── smarts.json             ← 196-class chemical taxonomy
│   └── unwanted_substructures.csv
│
├── docs/
│   ├── decisions/              ← Architecture Decision Records
│   ├── runbooks/               ← current_state.md, next_actions.md
│   └── superpowers/            ← Implementation plans + design specs
│
└── data/                       ← Pipeline outputs (gitignored; .gitkeep only)
    ├── admet/ · classification/ · conformers/ · fingerprints/
    ├── minimization/ · pdbqt/ · properties/ · standardized/
    └── tables/ · viz/

References

Tools & Models

Tool Citation
RDKit Landrum G. RDKit: Open-source cheminformatics. rdkit.org
Auto3D Liu Z et al. J Chem Inf Model. 2022;62:5373. doi:10.1021/acs.jcim.2c00817
AIMNet2 Anstine DM et al. ChemRxiv 2023. doi:10.26434/chemrxiv-2023-296ch
ADMET-AI Swanson K et al. Bioinformatics 2024. doi:10.1093/bioinformatics/btae416
meeko Forli S et al. AutoDock Meeko. github.com/forlilab/meeko
ECFP Rogers D, Hahn M. J Chem Inf Model. 2010;50:742. doi:10.1021/ci100050t
PAINS Baell JB, Holloway GA. J Med Chem. 2010;53:2719. doi:10.1021/jm901137j

Databases

Database Citation
COCONUT Sorokina M et al. J Cheminform. 2021;13:2. doi:10.1186/s13321-020-00478-9
ChEMBL Mendez D et al. Nucleic Acids Res. 2019;47:D930. doi:10.1093/nar/gky1075

ALCHEMIA Database · Accelerating Drug Discovery Through Open Science
preview.alchemiadatabase.com/pt · Sign up for early access

Releases

No releases published

Packages

 
 
 

Contributors