ML pipeline for CPP classification and cellular uptake regression, built on the POSEIDON experimental database. Developed as part of the DataCon 3.0 hackathon (AI in Chemistry track).
Cell-penetrating peptides (CPPs) are short amino acid sequences capable of crossing cell membranes, enabling intracellular delivery of drugs, nucleic acids, and proteins. The goal is to design CPPs with superior activity using ML models — reducing the need for costly wet-lab screening.
Data
♡ Source: POSEIDON database (~2 000 experimental records)
♡ Preprocessing: messy uptake parsing (±, <, / notation), time/temp normalization,
IQR-based outlier removal, modified sequence filtering
♡ Feature engineering: RDKit molecular descriptors (MolWt, TPSA, MolLogP, BertzCT,
BalabanJ, HeavyAtomCount, NHOHCount, NOCount, RingCount, Ipc, LabuteASA, MolMR, qed),
molecular mass via molmass, binary CPP label from curated FASTA/txt sources
Task 1 — CPP Classification
♡ Model: Random Forest (MinMaxScaler + SimpleImputer pipeline)
♡ Features: RDKit descriptors
♡ Target: CPP / non-CPP binary label
| Metric | Score |
|---|---|
| Accuracy | 0.802 |
| Precision | 0.804 |
| Recall | 0.802 |
| F1-score | 0.801 |
Task 2 — Cellular Uptake Regression
♡ Model: CatBoost (grid search over iterations, learning rate, depth)
♡ Features: descriptors + experimental conditions (cell line, cargo, method, time, temp)
♡ Target: normalized uptake mean
| Metric | Score |
|---|---|
| R² | 0.79 |
| MAE | 69.56 |
| RMSE | 225.13 |
POSEIDON database · CPPBase (FASTA) · Experimental CPP/non-CPP sequence lists