We're processing millions of molecules. While we prepare the full dataset for public release, sign up for early access — we'll notify you by email when we launch.
→ preview.alchemiadatabase.com/pt ←
Be among the first to access the Alchemia dataset.
ALCHEMIA is an open-source molecular data bank built for ultra-large virtual screening (ULVS) in drug discovery. It aggregates molecules from 6 public databases, standardizes them through a reproducible, GPU-accelerated Python pipeline, and delivers research-ready datasets at scale.
Phase 01 deliverables:
- Standardized chemical structures — SMILES, InChI, InChIKey, cross-source deduplication (ALC_XXXXXXX compound keys)
- Energy-minimized 3D conformers — AIMNet2 neural network potential (near-QM accuracy) via Auto3D, MMFF94s fallback
- ADMET property predictions — 30+ endpoints via ADMET-AI (absorption, distribution, metabolism, excretion, toxicity)
- Molecular fingerprints — ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet, PostgreSQL
bfpcompatible - Drug-likeness & structural filters — Lipinski, Veber, Ghose, QED, PAINS (480+), Brenk, NIH, toxicophores
- PDBQT-ready ligands — meeko-based preparation for AutoDock/Vina workflows
- PostgreSQL database — RDKit cartridge + pgvector + pg_trgm for similarity search at scale
All pipeline stages are chunked, checkpointed, and resumable. No stage ever loads a full SDF/MOL2/CSV into memory.
Raw sources (SDF / MOL2 / CSV)
→ Audit & Inventory (source manifest, file checksums)
→ Standardization + Cross-Dedup (RDKit MolStandardize, InChI keys, ALC_XXXXXXX)
→ 2D Properties + Drug-Likeness (MW, logP, TPSA, HBD/HBA, QED, Lipinski/Veber/Ghose)
→ Structural Filters (PAINS 480+, Brenk, NIH, toxicophores — flags, not exclusions)
→ Fingerprints (ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet)
→ Conformer Generation (ETKDGv3, MMFF94s geometry refinement, 3D QC)
→ AIMNet2 Energy Minimization (GPU, near-QM accuracy, H/C/N/O/F/S/Cl)
→ 3D SDF Export (AIMNet2-ok first, MMFF94s fallback)
→ ADMET Predictions (GPU, 30+ endpoints, ADMET-AI)
→ PDBQT Preparation (meeko, AutoDock4 format)
→ Molecular Classification (196 SMARTS classes)
→ Visualization (RDKit mol grid images)
→ PostgreSQL Load (RDKit cartridge + pgvector + pg_trgm)
→ Validation & Audit (QC log, qc_failures.parquet)
The pipeline is orchestrated with Snakemake (14 rules) and runs on Docker images optimized for NVIDIA DGX clusters (8× A100 80GB) or local GPU workstations (RTX 5070+).
| Database | Description | Compounds | Phase 01 Status | Website |
|---|---|---|---|---|
| BrNPDB | Brazilian Natural Products | 9,215 | ✅ All stages complete + PDBQT + PostgreSQL | brnpdb.org |
| NCI | National Cancer Institute | 85,495 | ✅ All stages complete + PDBQT + PostgreSQL | cactus.nci.nih.gov |
| COCONUT | Natural products | 725,267 | Std–Props–FP–ADMET–Class ✅ · AIMNet2 🔄 7/8 shards (merge + PDBQT pending) | coconut.naturalproducts.net |
| ChEMBL | Bioactive molecules | ~2,854,815 std | Std–Props–FP–Class ✅ · ADMET 🔄 ~18.7% · AIMNet2 🔄 shards 0-1 active | ebi.ac.uk/chembl |
| Enamine | Screening compounds | ~5,000,000+ std | Std–Props–FP–Class ✅ · ADMET 🔄 running · AIMNet2 🔄 shards 0-6 active | enamine.net |
| Molport | Commercial catalog | ~7,000,000+ | ⏳ Not started | molport.com |
Standardized to date: ~8.7M+ compounds across 5 databases, deduplicated by InChI. Full pipeline (all stages) complete for BrNPDB + NCI; COCONUT completing AIMNet2 minimization; ChEMBL + Enamine ADMET and AIMNet2 running on 8× A100.
| Feature | Details |
|---|---|
| Multi-Source | COCONUT, NCI, BrNPDB, ChEMBL, Enamine, Molport — 6 databases, one unified schema |
| AIMNet2 Minimization | GPU-accelerated 3D energy minimization, near-QM accuracy (supports H, C, N, O, F, S, Cl) |
| ADMET Predictions | 30+ endpoints: absorption, distribution, metabolism, excretion, toxicity via ADMET-AI |
| 5 Fingerprint Types | ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet + PostgreSQL bfp |
| Structural Filters | PAINS (480+), Brenk, NIH, toxicophores — warning flags, not hard exclusions |
| Drug-Likeness | Lipinski, Veber, Ghose, lead-like, fragment-like, QED |
| 196-Class SMARTS | Chemical taxonomy from utils/smarts.json |
| PDBQT Prep | meeko-based AutoDock/Vina-ready PDBQT generation (pure Python 3.11, no MGLTools) |
| Docker Pipeline | 5 specialized images: base (3.35 GB) · cpu (3.38 GB) · gpu (36.2 GB) · pdbqt (3.43 GB) · snakemake (2.52 GB) |
| Streaming I/O | Chunked, checkpointed, resumable — never OOM on multi-million-compound sources |
| Deterministic Keys | compound_key = ALC_XXXXXXX (SHA256 of InChI) — stable join key across all tables |
| PostgreSQL-Ready | RDKit cartridge + pgvector (similarity search) + pg_trgm + pgcrypto |
Each source produces a suite of Parquet files, SDF exports, and PDBQT files:
| File | Key Columns |
|---|---|
{source}_unique_compounds.parquet |
compound_key, canonical_smiles, inchi, inchikey, source_compound_id |
{source}_properties.parquet |
compound_key, mw, logp, tpsa, hbd, hba, qed, lipinski_pass |
{source}_fingerprints.parquet |
compound_key, ecfp4, ecfp6, maccs, rdkit_fp, atompair, torsion (binary) |
{source}_admet.parquet |
compound_key, admet_model_name, admet_model_version, predictions_json |
{source}_complete_3d.sdf |
3D molblocks — AIMNet2-minimized first, MMFF94s fallback, _Name = compound_key |
{source}_complete_minimized.parquet |
compound_key, minimization_method, energy_kcal_mol, min_status |
{source}_pdbqt_manifest.parquet |
compound_key, pdbqt_path, sha256, num_torsions, pdbqt_status |
{source}_classification.parquet |
compound_key, matched_classes (JSON), primary_class, num_classes |
master_compounds.parquet |
Cross-source deduplicated — compound_key (ALC_*), source_name, inchi, canonical_smiles |
cross_reference.parquet |
source_compound_key, source_name, source_compound_id, master_compound_key |
Input: raw SMILES / existing 3D coordinates
1. ETKDGv3 conformer generation (RDKit)
2. MMFF94s geometry refinement (UFF fallback)
3. AIMNet2 energy minimization via Auto3D
— Supported elements: H, C, N, O, F, S, Cl
— GPU: NVIDIA A100 / RTX 5070 (CUDA 12.8)
— Graceful skip for unsupported elements (MMFF94s result kept)
4. Complete 3D SDF export (AIMNet2-ok first, MMFF94s fallback)
5. PDBQT generation via meeko (AutoDock4 format)
- Python 3.11, conda/mamba
- NVIDIA GPU + CUDA 12.8 (for AIMNet2 minimization and ADMET-AI)
- Docker + NVIDIA Container Toolkit (for containerized pipeline)
git clone https://github.com/alchemia-db/alchemia.git
cd alchemia
# Create conda environment (RDKit, Polars, PyArrow included)
conda env create -f environment.yml
conda activate alchemia-ph1
# PyTorch with CUDA 12.8 (RTX 5070 / A100 Blackwell)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Editable install
pip install -e ".[dev]"
# Run tests
pytest tests/ -v# Build all images (base → cpu → gpu → pdbqt → snakemake)
bash docker/build.sh
# CPU stage (standardization, properties, fingerprints)
docker run --rm -v "$PWD:/workspace" alchemia/cpu \
python scripts/run_standardization.py --source BrNPDB --sample 1000
# GPU stage (AIMNet2 minimization)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
python scripts/run_minimization.py --source BrNPDB --device cuda
# GPU stage (ADMET predictions)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
python scripts/run_admet.py --source BrNPDB
# PDBQT preparation
docker run --rm -v "$PWD:/workspace" alchemia/pdbqt \
python scripts/run_pdbqt.py --source BrNPDB
# Full Snakemake pipeline (dry-run first)
docker run --rm -v "$PWD:/workspace" alchemia/snakemake \
snakemake --cores 8 --dry-runAll scripts support --sample N, --dry-run, --resume:
python scripts/audit_repository.py # Audit & inventory
python scripts/run_standardization.py # Standardize + cross-dedup
python scripts/run_properties.py # 2D descriptors + drug-likeness
python scripts/run_filters.py # PAINS / Brenk / NIH flags
python scripts/run_fingerprints.py # ECFP4/6 / MACCS / RDKit / AtomPair
python scripts/run_conformers.py # ETKDGv3 conformer generation (--shard-start/--shard-end/--output-suffix for parallel shards)
python scripts/patch_auto3d.py # Patch Auto3D np.min([]) crash — must run before every AIMNet2 container
python scripts/run_minimization.py # AIMNet2 energy minimization (--shard-start/--shard-end/--output-suffix)
python scripts/merge_minimized_sdf.py # Complete 3D SDF export (--num-shards N)
python scripts/run_admet.py # ADMET-AI predictions
python scripts/run_pdbqt.py # PDBQT via meeko
python scripts/run_classification.py # 196-class SMARTS taxonomy
python scripts/run_viz.py # Molecular image grids
python scripts/load_postgres.py # PostgreSQL load
python scripts/orchestrate_gpus.py # GPU watchdog — auto-dispatches pipeline tasks to idle GPUs
python scripts/dashboard.py # Terminal TUI — real-time pipeline progress and GPU monitoringOnly GitHub-tracked files are shown (databases, pipeline outputs, and local agent files are gitignored):
alchemia/
├── .gitignore
├── LICENSE
├── README.md
├── docker-compose.yml ← PostgreSQL, Redis, API services
├── environment.yml ← Conda env: Python 3.11, RDKit, Polars, CUDA
├── pyproject.toml ← Package definition + dev dependencies
│
├── assets/ ← Logos, banners, pipeline diagrams
│ ├── logos/ ← 6 SVG logo variants (white, dark, icon-only)
│ ├── alchemia_banner_v2.svg ← Animated hero banner (this README)
│ └── alchemia_phase01_pipeline.svg ← Phase 01 pipeline diagram
│
├── configs/
│ ├── hardware.yaml ← CPU/GPU/RAM tuning, checkpoint cadence
│ ├── pipeline.yaml ← Stage settings, batch sizes, paths
│ └── sources/ ← Per-source YAML configs (6 files)
│
├── docker/
│ ├── build.sh ← Build all 5 images in dependency order
│ ├── base/ ← Miniconda + conda-forge
│ ├── cpu/ ← RDKit, Polars, Snakemake
│ ├── gpu/ ← PyTorch 2.8 + ADMET-AI + AIMNet2 (CUDA 12.8)
│ ├── pdbqt/ ← Python 3.11 + meeko + gemmi
│ ├── snakemake/ ← Snakemake orchestrator
│ └── postgres/ ← RDKit cartridge init SQL
│
├── pipeline/
│ ├── Snakefile ← Main DAG entry point
│ ├── config/pipeline.yaml
│ ├── config/profiles/dgx/ ← NVIDIA DGX A100 Snakemake profile
│ └── rules/ ← 14 .smk rule files (00_audit → 13_validate)
│
├── scripts/ ← Thin CLI entrypoints (delegate to src/)
│ └── (13 scripts, one per pipeline stage)
│
├── src/
│ └── alchemia/ ← Main Python package
│ ├── admet/ ← ADMET-AI runner, chunked + checkpointed
│ ├── classification/ ← 196-class SMARTS classifier
│ ├── conformers/ ← ETKDGv3 generator + 3D QC
│ ├── descriptors/ ← 2D properties + drug-likeness
│ ├── filters/ ← PAINS, Brenk, NIH, toxicophores
│ ├── fingerprints/ ← ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion
│ ├── io/ ← Streaming SDF/MOL2/CSV readers
│ ├── minimization/ ← AIMNet2 via Auto3D + MMFF94s fallback
│ ├── pdbqt/ ← meeko PDBQT preparation
│ ├── postgres/ ← Schema, staging, loaders
│ ├── sources/ ← Per-source profiler + parser
│ ├── standardization/ ← RDKit MolStandardize + InChI dedup
│ ├── utils/ ← Logging, checksums, QC logger
│ └── viz/ ← Molecular image grid generator
│
├── tests/unit/ ← 22 test modules, 63+ tests
│
├── utils/
│ ├── pains.json ← 480+ PAINS SMARTS patterns
│ ├── smarts.json ← 196-class chemical taxonomy
│ └── unwanted_substructures.csv
│
├── docs/
│ ├── decisions/ ← Architecture Decision Records
│ ├── runbooks/ ← current_state.md, next_actions.md
│ └── superpowers/ ← Implementation plans + design specs
│
└── data/ ← Pipeline outputs (gitignored; .gitkeep only)
├── admet/ · classification/ · conformers/ · fingerprints/
├── minimization/ · pdbqt/ · properties/ · standardized/
└── tables/ · viz/
| Tool | Citation |
|---|---|
| RDKit | Landrum G. RDKit: Open-source cheminformatics. rdkit.org |
| Auto3D | Liu Z et al. J Chem Inf Model. 2022;62:5373. doi:10.1021/acs.jcim.2c00817 |
| AIMNet2 | Anstine DM et al. ChemRxiv 2023. doi:10.26434/chemrxiv-2023-296ch |
| ADMET-AI | Swanson K et al. Bioinformatics 2024. doi:10.1093/bioinformatics/btae416 |
| meeko | Forli S et al. AutoDock Meeko. github.com/forlilab/meeko |
| ECFP | Rogers D, Hahn M. J Chem Inf Model. 2010;50:742. doi:10.1021/ci100050t |
| PAINS | Baell JB, Holloway GA. J Med Chem. 2010;53:2719. doi:10.1021/jm901137j |
| Database | Citation |
|---|---|
| COCONUT | Sorokina M et al. J Cheminform. 2021;13:2. doi:10.1186/s13321-020-00478-9 |
| ChEMBL | Mendez D et al. Nucleic Acids Res. 2019;47:D930. doi:10.1093/nar/gky1075 |
ALCHEMIA Database · Accelerating Drug Discovery Through Open Science
preview.alchemiadatabase.com/pt · Sign up for early access