Detection of Triple Negative Breast Cancer from RNA sequencing data using Machine Learning

This is a college project where we investigated whether RNA sequencing (RNA-seq) data alone can be used to detect Triple Negative Breast Cancer (TNBC) using machine learning techniques. TNBC is an aggressive variant of breast cancer that responds particularly poorly to treatment. Using data from the TCGA-BRCA project, we developed a binary classification model to distinguish TNBC from non-TNBC cases. The goal was to evaluate different feature selection methods, compare model performance, and analyze the explainability of the best-performing models.

Ruben Holthuijsen (@rhlt) • Sander van Swieten (@1063788) • Vince van Doorn (@Vinciepincie) • Victor de Sousa Gama (@VictordeSousaGama) • Kevin Hartman (@Kevin4032)

Requirements

Python 3.8+
Jupyter Notebook 7.2+
Required packages:
- Pandas
- scikit-learn
Data from the TCGA-BRCA project

Workflow

Below is a list of steps of the workflow that we have implemented. All steps (except Documentation) have a specific folder that contains one or more Jupyter notebooks where this step happens. Every notebook will import some existing data from the Data folder, work on it, and then export the result to a new file in the Data folder. These results are not committed to the repository. For every step, different variants may exist, for example for developing different models or selecting different features.

Data acquisition: Case data from the TCGA-BRCA project should go in the Data folder. See the README in that folder for instructions on how to acquire the data;
Data loading: Cleaning, filtering, and classification (label generation) of the data is done in the DataLoading folder. This also links the clinical data to the correct RNA-seq files.
Preprocessing: Gene expression data is scaled and standardised in the Preprocessing folder.
Features: Features (genes) will be selected based on literature and other methods in the Features folder.
Model development: Different models are trained in the Model folder.
Model evaluation: Performence of the trained models is evaluated in the Evaluation folder.
Model analysis: The results from our analysis of the different models and methods we applied, which was used to select the best performing model, can be found in the Analysis folder.
Documentation and reporting: The final report, as well as the original proposal, can be found in the top folder (here).

Full process and results

The whole process and all results are documented in the final report that is included in this repository. In summary, the best performing model from our tests was a Random Forest using features selected via LASSO. It achieved a precision of 75.86%, recall of 95.65%, and F1-score of 84.62%. Additional explainability was achieved by applying LIME and SHAP.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
Analysis		Analysis
Data		Data
DataLoading		DataLoading
Evaluation		Evaluation
Features		Features
Model		Model
Preprocessing		Preprocessing
.gitignore		.gitignore
CapstoneProject_FinalPresentation.pptx		CapstoneProject_FinalPresentation.pptx
CapstoneProject_FinalReport.docx		CapstoneProject_FinalReport.docx
CapstoneProject_Proposal.docx		CapstoneProject_Proposal.docx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Detection of Triple Negative Breast Cancer from RNA sequencing data using Machine Learning

Requirements

Workflow

Full process and results

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

rhlt/tnbc

Folders and files

Latest commit

History

Repository files navigation

Detection of Triple Negative Breast Cancer from RNA sequencing data using Machine Learning

Requirements

Workflow

Full process and results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages