GitHub - alchemia-db/alchemia: Alchemia Molecular Database: Accelerating Drug Discovery with AI

Early Access

We're processing millions of molecules. While we prepare the full dataset for public release, sign up for early access — we'll notify you by email when we launch.

→ preview.alchemiadatabase.com/pt ←
Be among the first to access the Alchemia dataset.

Overview

ALCHEMIA is an open-source molecular data bank built for ultra-large virtual screening (ULVS) in drug discovery. It aggregates molecules from 6 public databases, standardizes them through a reproducible, GPU-accelerated Python pipeline, and delivers research-ready datasets at scale.

Phase 01 deliverables:

Standardized chemical structures — SMILES, InChI, InChIKey, cross-source deduplication (ALC_XXXXXXX compound keys)
Energy-minimized 3D conformers — AIMNet2 neural network potential (near-QM accuracy) via Auto3D, MMFF94s fallback
ADMET property predictions — 30+ endpoints via ADMET-AI (absorption, distribution, metabolism, excretion, toxicity)
Molecular fingerprints — ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet, PostgreSQL bfp compatible
Drug-likeness & structural filters — Lipinski, Veber, Ghose, QED, PAINS (480+), Brenk, NIH, toxicophores
PDBQT-ready ligands — meeko-based preparation for AutoDock/Vina workflows
PostgreSQL database — RDKit cartridge + pgvector + pg_trgm for similarity search at scale

All pipeline stages are chunked, checkpointed, and resumable. No stage ever loads a full SDF/MOL2/CSV into memory.

Phase 01 Pipeline

Raw sources (SDF / MOL2 / CSV)
  → Audit & Inventory             (source manifest, file checksums)
  → Standardization + Cross-Dedup (RDKit MolStandardize, InChI keys, ALC_XXXXXXX)
  → 2D Properties + Drug-Likeness (MW, logP, TPSA, HBD/HBA, QED, Lipinski/Veber/Ghose)
  → Structural Filters            (PAINS 480+, Brenk, NIH, toxicophores — flags, not exclusions)
  → Fingerprints                  (ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet)
  → Conformer Generation          (ETKDGv3, MMFF94s geometry refinement, 3D QC)
  → AIMNet2 Energy Minimization   (GPU, near-QM accuracy, H/C/N/O/F/S/Cl)
  → 3D SDF Export                 (AIMNet2-ok first, MMFF94s fallback)
  → ADMET Predictions             (GPU, 30+ endpoints, ADMET-AI)
  → PDBQT Preparation             (meeko, AutoDock4 format)
  → Molecular Classification      (196 SMARTS classes)
  → Visualization                 (RDKit mol grid images)
  → PostgreSQL Load               (RDKit cartridge + pgvector + pg_trgm)
  → Validation & Audit            (QC log, qc_failures.parquet)

The pipeline is orchestrated with Snakemake (14 rules) and runs on Docker images optimized for NVIDIA DGX clusters (8× A100 80GB) or local GPU workstations (RTX 5070+).

Source Databases

Database	Description	Compounds	Phase 01 Status	Website
BrNPDB	Brazilian Natural Products	9,215	✅ All stages complete + PDBQT + PostgreSQL	brnpdb.org
NCI	National Cancer Institute	85,495	✅ All stages complete + PDBQT + PostgreSQL	cactus.nci.nih.gov
COCONUT	Natural products	725,267	Std–Props–FP–ADMET–Class ✅ · AIMNet2 🔄 7/8 shards (merge + PDBQT pending)	coconut.naturalproducts.net
ChEMBL	Bioactive molecules	~2,854,815 std	Std–Props–FP–Class ✅ · ADMET 🔄 ~18.7% · AIMNet2 🔄 shards 0-1 active	ebi.ac.uk/chembl
Enamine	Screening compounds	~5,000,000+ std	Std–Props–FP–Class ✅ · ADMET 🔄 running · AIMNet2 🔄 shards 0-6 active	enamine.net
Molport	Commercial catalog	~7,000,000+	⏳ Not started	molport.com

Standardized to date: ~8.7M+ compounds across 5 databases, deduplicated by InChI. Full pipeline (all stages) complete for BrNPDB + NCI; COCONUT completing AIMNet2 minimization; ChEMBL + Enamine ADMET and AIMNet2 running on 8× A100.

Features

Feature	Details
Multi-Source	COCONUT, NCI, BrNPDB, ChEMBL, Enamine, Molport — 6 databases, one unified schema
AIMNet2 Minimization	GPU-accelerated 3D energy minimization, near-QM accuracy (supports H, C, N, O, F, S, Cl)
ADMET Predictions	30+ endpoints: absorption, distribution, metabolism, excretion, toxicity via ADMET-AI
5 Fingerprint Types	ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet + PostgreSQL `bfp`
Structural Filters	PAINS (480+), Brenk, NIH, toxicophores — warning flags, not hard exclusions
Drug-Likeness	Lipinski, Veber, Ghose, lead-like, fragment-like, QED
196-Class SMARTS	Chemical taxonomy from `utils/smarts.json`
PDBQT Prep	meeko-based AutoDock/Vina-ready PDBQT generation (pure Python 3.11, no MGLTools)
Docker Pipeline	5 specialized images: `base` (3.35 GB) · `cpu` (3.38 GB) · `gpu` (36.2 GB) · `pdbqt` (3.43 GB) · `snakemake` (2.52 GB)
Streaming I/O	Chunked, checkpointed, resumable — never OOM on multi-million-compound sources
Deterministic Keys	`compound_key` = ALC_XXXXXXX (SHA256 of InChI) — stable join key across all tables
PostgreSQL-Ready	RDKit cartridge + pgvector (similarity search) + pg_trgm + pgcrypto

Data Outputs

Each source produces a suite of Parquet files, SDF exports, and PDBQT files:

File	Key Columns
`{source}_unique_compounds.parquet`	`compound_key`, `canonical_smiles`, `inchi`, `inchikey`, `source_compound_id`
`{source}_properties.parquet`	`compound_key`, `mw`, `logp`, `tpsa`, `hbd`, `hba`, `qed`, `lipinski_pass`
`{source}_fingerprints.parquet`	`compound_key`, `ecfp4`, `ecfp6`, `maccs`, `rdkit_fp`, `atompair`, `torsion` (binary)
`{source}_admet.parquet`	`compound_key`, `admet_model_name`, `admet_model_version`, `predictions_json`
`{source}_complete_3d.sdf`	3D molblocks — AIMNet2-minimized first, MMFF94s fallback, `_Name` = `compound_key`
`{source}_complete_minimized.parquet`	`compound_key`, `minimization_method`, `energy_kcal_mol`, `min_status`
`{source}_pdbqt_manifest.parquet`	`compound_key`, `pdbqt_path`, `sha256`, `num_torsions`, `pdbqt_status`
`{source}_classification.parquet`	`compound_key`, `matched_classes` (JSON), `primary_class`, `num_classes`
`master_compounds.parquet`	Cross-source deduplicated — `compound_key (ALC_*)`, `source_name`, `inchi`, `canonical_smiles`
`cross_reference.parquet`	`source_compound_key`, `source_name`, `source_compound_id`, `master_compound_key`

3D Structure Pipeline

Input: raw SMILES / existing 3D coordinates
  1. ETKDGv3 conformer generation (RDKit)
  2. MMFF94s geometry refinement (UFF fallback)
  3. AIMNet2 energy minimization via Auto3D
     — Supported elements: H, C, N, O, F, S, Cl
     — GPU: NVIDIA A100 / RTX 5070 (CUDA 12.8)
     — Graceful skip for unsupported elements (MMFF94s result kept)
  4. Complete 3D SDF export (AIMNet2-ok first, MMFF94s fallback)
  5. PDBQT generation via meeko (AutoDock4 format)

Installation

Prerequisites

Python 3.11, conda/mamba
NVIDIA GPU + CUDA 12.8 (for AIMNet2 minimization and ADMET-AI)
Docker + NVIDIA Container Toolkit (for containerized pipeline)

Local Setup

git clone https://github.com/alchemia-db/alchemia.git
cd alchemia

# Create conda environment (RDKit, Polars, PyArrow included)
conda env create -f environment.yml
conda activate alchemia-ph1

# PyTorch with CUDA 12.8 (RTX 5070 / A100 Blackwell)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Editable install
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Docker Pipeline

# Build all images (base → cpu → gpu → pdbqt → snakemake)
bash docker/build.sh

# CPU stage (standardization, properties, fingerprints)
docker run --rm -v "$PWD:/workspace" alchemia/cpu \
  python scripts/run_standardization.py --source BrNPDB --sample 1000

# GPU stage (AIMNet2 minimization)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_minimization.py --source BrNPDB --device cuda

# GPU stage (ADMET predictions)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_admet.py --source BrNPDB

# PDBQT preparation
docker run --rm -v "$PWD:/workspace" alchemia/pdbqt \
  python scripts/run_pdbqt.py --source BrNPDB

# Full Snakemake pipeline (dry-run first)
docker run --rm -v "$PWD:/workspace" alchemia/snakemake \
  snakemake --cores 8 --dry-run

Pipeline Scripts

All scripts support --sample N, --dry-run, --resume:

python scripts/audit_repository.py          # Audit & inventory
python scripts/run_standardization.py       # Standardize + cross-dedup
python scripts/run_properties.py            # 2D descriptors + drug-likeness
python scripts/run_filters.py               # PAINS / Brenk / NIH flags
python scripts/run_fingerprints.py          # ECFP4/6 / MACCS / RDKit / AtomPair
python scripts/run_conformers.py            # ETKDGv3 conformer generation (--shard-start/--shard-end/--output-suffix for parallel shards)
python scripts/patch_auto3d.py              # Patch Auto3D np.min([]) crash — must run before every AIMNet2 container
python scripts/run_minimization.py          # AIMNet2 energy minimization (--shard-start/--shard-end/--output-suffix)
python scripts/merge_minimized_sdf.py       # Complete 3D SDF export (--num-shards N)
python scripts/run_admet.py                 # ADMET-AI predictions
python scripts/run_pdbqt.py                 # PDBQT via meeko
python scripts/run_classification.py        # 196-class SMARTS taxonomy
python scripts/run_viz.py                   # Molecular image grids
python scripts/load_postgres.py             # PostgreSQL load
python scripts/orchestrate_gpus.py          # GPU watchdog — auto-dispatches pipeline tasks to idle GPUs
python scripts/dashboard.py                 # Terminal TUI — real-time pipeline progress and GPU monitoring

Repository Structure

Only GitHub-tracked files are shown (databases, pipeline outputs, and local agent files are gitignored):

alchemia/
├── .gitignore
├── LICENSE
├── README.md
├── docker-compose.yml          ← PostgreSQL, Redis, API services
├── environment.yml             ← Conda env: Python 3.11, RDKit, Polars, CUDA
├── pyproject.toml              ← Package definition + dev dependencies
│
├── assets/                     ← Logos, banners, pipeline diagrams
│   ├── logos/                  ← 6 SVG logo variants (white, dark, icon-only)
│   ├── alchemia_banner_v2.svg  ← Animated hero banner (this README)
│   └── alchemia_phase01_pipeline.svg  ← Phase 01 pipeline diagram
│
├── configs/
│   ├── hardware.yaml           ← CPU/GPU/RAM tuning, checkpoint cadence
│   ├── pipeline.yaml           ← Stage settings, batch sizes, paths
│   └── sources/                ← Per-source YAML configs (6 files)
│
├── docker/
│   ├── build.sh                ← Build all 5 images in dependency order
│   ├── base/                   ← Miniconda + conda-forge
│   ├── cpu/                    ← RDKit, Polars, Snakemake
│   ├── gpu/                    ← PyTorch 2.8 + ADMET-AI + AIMNet2 (CUDA 12.8)
│   ├── pdbqt/                  ← Python 3.11 + meeko + gemmi
│   ├── snakemake/              ← Snakemake orchestrator
│   └── postgres/               ← RDKit cartridge init SQL
│
├── pipeline/
│   ├── Snakefile               ← Main DAG entry point
│   ├── config/pipeline.yaml
│   ├── config/profiles/dgx/   ← NVIDIA DGX A100 Snakemake profile
│   └── rules/                  ← 14 .smk rule files (00_audit → 13_validate)
│
├── scripts/                    ← Thin CLI entrypoints (delegate to src/)
│   └── (13 scripts, one per pipeline stage)
│
├── src/
│   └── alchemia/               ← Main Python package
│       ├── admet/              ← ADMET-AI runner, chunked + checkpointed
│       ├── classification/     ← 196-class SMARTS classifier
│       ├── conformers/         ← ETKDGv3 generator + 3D QC
│       ├── descriptors/        ← 2D properties + drug-likeness
│       ├── filters/            ← PAINS, Brenk, NIH, toxicophores
│       ├── fingerprints/       ← ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion
│       ├── io/                 ← Streaming SDF/MOL2/CSV readers
│       ├── minimization/       ← AIMNet2 via Auto3D + MMFF94s fallback
│       ├── pdbqt/              ← meeko PDBQT preparation
│       ├── postgres/           ← Schema, staging, loaders
│       ├── sources/            ← Per-source profiler + parser
│       ├── standardization/    ← RDKit MolStandardize + InChI dedup
│       ├── utils/              ← Logging, checksums, QC logger
│       └── viz/                ← Molecular image grid generator
│
├── tests/unit/                 ← 22 test modules, 63+ tests
│
├── utils/
│   ├── pains.json              ← 480+ PAINS SMARTS patterns
│   ├── smarts.json             ← 196-class chemical taxonomy
│   └── unwanted_substructures.csv
│
├── docs/
│   ├── decisions/              ← Architecture Decision Records
│   ├── runbooks/               ← current_state.md, next_actions.md
│   └── superpowers/            ← Implementation plans + design specs
│
└── data/                       ← Pipeline outputs (gitignored; .gitkeep only)
    ├── admet/ · classification/ · conformers/ · fingerprints/
    ├── minimization/ · pdbqt/ · properties/ · standardized/
    └── tables/ · viz/

References

Tools & Models

Tool	Citation
RDKit	Landrum G. RDKit: Open-source cheminformatics. rdkit.org
Auto3D	Liu Z et al. J Chem Inf Model. 2022;62:5373. doi:10.1021/acs.jcim.2c00817
AIMNet2	Anstine DM et al. ChemRxiv 2023. doi:10.26434/chemrxiv-2023-296ch
ADMET-AI	Swanson K et al. Bioinformatics 2024. doi:10.1093/bioinformatics/btae416
meeko	Forli S et al. AutoDock Meeko. github.com/forlilab/meeko
ECFP	Rogers D, Hahn M. J Chem Inf Model. 2010;50:742. doi:10.1021/ci100050t
PAINS	Baell JB, Holloway GA. J Med Chem. 2010;53:2719. doi:10.1021/jm901137j

Databases

Database	Citation
COCONUT	Sorokina M et al. J Cheminform. 2021;13:2. doi:10.1186/s13321-020-00478-9
ChEMBL	Mendez D et al. Nucleic Acids Res. 2019;47:D930. doi:10.1093/nar/gky1075

ALCHEMIA Database · Accelerating Drug Discovery Through Open Science
preview.alchemiadatabase.com/pt · Sign up for early access

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Early Access

Overview

Phase 01 Pipeline

Source Databases

Features

Data Outputs

3D Structure Pipeline

Installation

Prerequisites

Local Setup

Docker Pipeline

Pipeline Scripts

Repository Structure

References

Tools & Models

Databases

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
assets		assets
configs		configs
data		data
docker		docker
docs		docs
logs		logs
pipeline		pipeline
playbooks		playbooks
scripts		scripts
src/alchemia		src/alchemia
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Early Access

Overview

Phase 01 Pipeline

Source Databases

Features

Data Outputs

3D Structure Pipeline

Installation

Prerequisites

Local Setup

Docker Pipeline

Pipeline Scripts

Repository Structure

References

Tools & Models

Databases

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages