Cell-Penetrating Peptide Activity Prediction (CPP)

ML pipeline for CPP classification and cellular uptake regression, built on the POSEIDON experimental database. Developed as part of the DataCon 3.0 hackathon (AI in Chemistry track).

Problem

Cell-penetrating peptides (CPPs) are short amino acid sequences capable of crossing cell membranes, enabling intracellular delivery of drugs, nucleic acids, and proteins. The goal is to design CPPs with superior activity using ML models — reducing the need for costly wet-lab screening.

Pipeline

Data
♡ Source: POSEIDON database (~2 000 experimental records)
♡ Preprocessing: messy uptake parsing (±, <, / notation), time/temp normalization, IQR-based outlier removal, modified sequence filtering
♡ Feature engineering: RDKit molecular descriptors (MolWt, TPSA, MolLogP, BertzCT, BalabanJ, HeavyAtomCount, NHOHCount, NOCount, RingCount, Ipc, LabuteASA, MolMR, qed), molecular mass via molmass, binary CPP label from curated FASTA/txt sources

Task 1 — CPP Classification
♡ Model: Random Forest (MinMaxScaler + SimpleImputer pipeline)
♡ Features: RDKit descriptors
♡ Target: CPP / non-CPP binary label

Metric	Score
Accuracy	0.802
Precision	0.804
Recall	0.802
F1-score	0.801

Task 2 — Cellular Uptake Regression
♡ Model: CatBoost (grid search over iterations, learning rate, depth)
♡ Features: descriptors + experimental conditions (cell line, cargo, method, time, temp)
♡ Target: normalized uptake mean

Metric	Score
R²	0.79
MAE	69.56
RMSE	225.13

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
DataCon24		DataCon24
Datasets		Datasets
README.md		README.md
cpp_database.db		cpp_database.db
main.ipynb		main.ipynb
new_df2.csv		new_df2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cell-Penetrating Peptide Activity Prediction (CPP)

Problem

Pipeline

Stack

Data Sources

Materials

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cell-Penetrating Peptide Activity Prediction (CPP)

Problem

Pipeline

Stack

Data Sources

Materials

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages