This is a college project where we investigated whether RNA sequencing (RNA-seq) data alone can be used to detect Triple Negative Breast Cancer (TNBC) using machine learning techniques. TNBC is an aggressive variant of breast cancer that responds particularly poorly to treatment. Using data from the TCGA-BRCA project, we developed a binary classification model to distinguish TNBC from non-TNBC cases. The goal was to evaluate different feature selection methods, compare model performance, and analyze the explainability of the best-performing models.
Ruben Holthuijsen (@rhlt) • Sander van Swieten (@1063788) • Vince van Doorn (@Vinciepincie) • Victor de Sousa Gama (@VictordeSousaGama) • Kevin Hartman (@Kevin4032)
- Python 3.8+
- Jupyter Notebook 7.2+
- Required packages:
- Pandas
- scikit-learn
- Data from the TCGA-BRCA project
Below is a list of steps of the workflow that we have implemented. All steps (except Documentation) have a specific folder that contains one or more Jupyter notebooks where this step happens. Every notebook will import some existing data from the Data folder, work on it, and then export the result to a new file in the Data folder. These results are not committed to the repository. For every step, different variants may exist, for example for developing different models or selecting different features.
- Data acquisition: Case data from the TCGA-BRCA project should go in the Data folder. See the README in that folder for instructions on how to acquire the data;
- Data loading: Cleaning, filtering, and classification (label generation) of the data is done in the DataLoading folder. This also links the clinical data to the correct RNA-seq files.
- Preprocessing: Gene expression data is scaled and standardised in the Preprocessing folder.
- Features: Features (genes) will be selected based on literature and other methods in the Features folder.
- Model development: Different models are trained in the Model folder.
- Model evaluation: Performence of the trained models is evaluated in the Evaluation folder.
- Model analysis: The results from our analysis of the different models and methods we applied, which was used to select the best performing model, can be found in the Analysis folder.
- Documentation and reporting: The final report, as well as the original proposal, can be found in the top folder (here).
The whole process and all results are documented in the final report that is included in this repository. In summary, the best performing model from our tests was a Random Forest using features selected via LASSO. It achieved a precision of 75.86%, recall of 95.65%, and F1-score of 84.62%. Additional explainability was achieved by applying LIME and SHAP.