An interactive web-based database for exploring chemical compounds across multiple toxicological and pharmacological categories. Browse 3000+ unique chemicals with SMILES notation, toxicity predictions, and custom classification scores.
This database may contain classification errors. The data was aggregated from multiple sources and some compounds may be incorrectly categorized. If you find a misclassified compound, please open an issue with the compound name and the correct classification.
This is an experimental reference tool, not an authoritative source. Predictions are based on molecular structure pattern matching and have significant limitations. Do not base medical or safety decisions on this data alone.
- 3000+ Chemical Compounds across 4 major categories (4,177 training entries, deduplicated by SMILES)
- SMILES Notation for each compound with molecular structure data
- Toxicity Predictions using EtoxPred model (tox-score & SA-score)
- Custom Classification Scores for carcinogenicity, psychoactivity, and endocrine disruption
- Interactive Search & Filtering with real-time results
- Dark-Themed UI optimized for readability
- Sortable Columns for easy data exploration
- Static SQLite Database - runs entirely client-side (via sql.js)
chemical-database-repo/
├── site/
│ ├── index.html # Main web interface
│ └── chemicals.db # SQLite database with all chemical data
├── training_data/
│ ├── carcinogens.smi # 736 known carcinogens
│ ├── endocrine_disruptors.smi # 732 endocrine disruptors
│ ├── nootropics.smi # 887 nootropic compounds
│ └── psychoactive_drugs.smi # 1616 psychoactive substances
├── etoxpred/
│ ├── custom_scores.csv # Classification predictions
│ └── results_*.csv # EtoxPred toxicity scores per category
└── README.md
Simply open site/index.html in any modern web browser. No server required - the entire database runs client-side using sql.js.
# Option 1: Direct file
open site/index.html
# Option 2: Local server (optional)
cd site && python3 -m http.server 8000
# Visit http://localhost:8000The database contains a single chemicals table:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
name |
TEXT | Chemical name (e.g., "Caffeine") |
smiles |
TEXT | SMILES notation (e.g., "CN1C=NC2=C1C(=O)N(C(=O)N2C)C") |
category |
TEXT | Comma-separated categories (nootropic, carcinogen, etc.) |
tox_score |
REAL | EtoxPred toxicity prediction (0-1, higher = more toxic) |
sa_score |
REAL | Synthetic accessibility score (0-1, normalized, higher = harder to synthesize) |
carc_score |
REAL | Carcinogenicity prediction (0-1) |
psych_score |
REAL | Psychoactivity prediction (0-1) |
endo_score |
REAL | Endocrine disruption prediction (0-1) |
A single-page application that runs entirely in the browser:
- sql.js: WebAssembly SQLite engine loads the database
- No backend required: Everything runs client-side
- Real-time search: Filters as you type
- Sortable columns: Click headers to sort
- Category filtering: Quick filtering by compound class
- Score visualization: Color-coded toxicity indicators
Features:
- Dark theme optimized for chemical data
- Responsive design (mobile-friendly)
- Inline help guide with scoring explanations
- Displays SMILES, categories, and all prediction scores
| Category | Count | Description |
|---|---|---|
| Nootropics | 887 | Cognitive enhancers, smart drugs, memory supplements |
| Psychoactive Drugs | 1616 | CNS-active compounds (stimulants, depressants, hallucinogens) |
| Carcinogens | 736 | Known or suspected cancer-causing agents |
| Endocrine Disruptors | 732 | Compounds that interfere with hormone systems |
All training data uses simplified molecular-input line-entry system (SMILES):
C1=CC=C(C=C1)O # Phenol
CN1C=NC2=C1C(=O)N(C(=O)N2C)C # Caffeine
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O # Ibuprofen
- Tab-separated values:
SMILES<tab>Name - Canonical SMILES from PubChem ensures consistency
- Each compound appears once per category file
- Compounds may exist in multiple category files
The database includes predictions from the EtoxPred model:
-
Tox-score (0-1): Overall toxicity prediction
- 0.0-0.3: Low toxicity (green)
- 0.3-0.7: Moderate toxicity (yellow)
- 0.7-1.0: High toxicity (red)
-
SA-score (1-10): Synthetic accessibility
- 1-3: Easy to synthesize
- 4-7: Moderate difficulty
- 8-10: Very difficult to synthesize
Machine learning models trained on the labeled training data:
- carc_score: Carcinogenicity prediction (0-1)
- psych_score: Psychoactivity prediction (0-1)
- endo_score: Endocrine disruption prediction (0-1)
Each classifier is trained as a binary one-vs-rest model:
- Features: Morgan fingerprints (ECFP4, radius=2, 1024 bits) via RDKit
- Positives: Compounds from the target category's
.smifile - Negatives: ~4,900 FDA-approved drugs only (not compounds from other categories, to avoid label leakage)
- Balancing: Negatives downsampled to 2:1 ratio vs positives
- Split: 80/20 train/test with scaffold split (Murcko decomposition) to prevent structural leakage
- Model: ExtraTreesClassifier (400 trees, min_samples_split=10, min_samples_leaf=3)
- Output: Probability score (0-1) from
predict_proba
Nootropics are not used in training (too heterogeneous) — they're only scored by the trained models.
| Model | ROC-AUC | PR-AUC | Accuracy |
|---|---|---|---|
| Carcinogen | 0.92 | 0.91 | 76% |
| Psychoactive | 0.86 | 0.72 | 84% |
| Endocrine Disruptor | 0.82 | 0.69 | 82% |
Note: Recall on positives is moderate (42-53%) due to conservative predictions — the models favor precision over recall.
- Prodrugs invisible: Structure-based fingerprints can't predict metabolic activation (e.g., 1,4-butanediol → GHB scores only 0.36 despite being psychoactive via metabolism)
- Small training sets: ~700-1100 positives per category limits generalization
- FDA negatives bias: Negatives are biased toward drug-like molecules
- None! Pure HTML/CSS/JavaScript
- sql.js loaded from CDN
The training script is in scripts/. Uses uv for dependency management:
cd scripts
uv sync
uv run python train_classifiers.pyOr with pip:
pip install rdkit scikit-learn pandas numpy joblib
python scripts/train_classifiers.py- Chemical Names: Curated from scientific literature, databases, and regulatory lists
- SMILES Notation: PubChem REST API
- Toxicity Predictions: EtoxPred (pre-trained model)
- Classifications: Labeled training data from multiple sources
Contributions welcome! Ways to help:
- Add more chemicals: Submit
.smifiles or chemical name lists - Improve predictions: Train better classification models
- Enhance UI: Add visualizations, filters, or export features
- Documentation: Add compound descriptions or sources
- Code: GNU General Public License v3.0 - see LICENSE
- Data: Chemical structures from PubChem (public domain)
- Models: EtoxPred and custom classifiers are provided as-is for research use
This database is for research and educational purposes only. Toxicity predictions are computational estimates and should not be used for:
- Medical decision-making
- Regulatory compliance
- Consumer product safety assessments
- Legal or forensic purposes
Always consult authoritative sources (EPA, FDA, ECHA) and perform proper experimental validation for any toxicological or pharmacological claims.
Built with 🧪 chemistry and ⚡ code