Skip to content

tame2tame/datacon2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cell-Penetrating Peptide Activity Prediction (CPP)

ML pipeline for CPP classification and cellular uptake regression, built on the POSEIDON experimental database. Developed as part of the DataCon 3.0 hackathon (AI in Chemistry track).

Problem

Cell-penetrating peptides (CPPs) are short amino acid sequences capable of crossing cell membranes, enabling intracellular delivery of drugs, nucleic acids, and proteins. The goal is to design CPPs with superior activity using ML models — reducing the need for costly wet-lab screening.

Pipeline

Data
♡ Source: POSEIDON database (~2 000 experimental records)
♡ Preprocessing: messy uptake parsing (±, <, / notation), time/temp normalization, IQR-based outlier removal, modified sequence filtering
♡ Feature engineering: RDKit molecular descriptors (MolWt, TPSA, MolLogP, BertzCT, BalabanJ, HeavyAtomCount, NHOHCount, NOCount, RingCount, Ipc, LabuteASA, MolMR, qed), molecular mass via molmass, binary CPP label from curated FASTA/txt sources

Task 1 — CPP Classification
♡ Model: Random Forest (MinMaxScaler + SimpleImputer pipeline)
♡ Features: RDKit descriptors
♡ Target: CPP / non-CPP binary label

Metric Score
Accuracy 0.802
Precision 0.804
Recall 0.802
F1-score 0.801

Task 2 — Cellular Uptake Regression
♡ Model: CatBoost (grid search over iterations, learning rate, depth)
♡ Features: descriptors + experimental conditions (cell line, cargo, method, time, temp)
♡ Target: normalized uptake mean

Metric Score
0.79
MAE 69.56
RMSE 225.13

Stack

Python Pandas RDKit BioPython molmass scikit-learn CatBoost

Data Sources

POSEIDON database · CPPBase (FASTA) · Experimental CPP/non-CPP sequence lists

Materials

About

CPP classification (F1 0.80) and cellular uptake regression (R² 0.79) on the POSEIDON database · DataCon 3.0 · Python · RDKit · CatBoost

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors