This project aims at using machine learning models to estimate the patient's chance of intubation in the Intensive Care Unit(ICU) based on the MIMIC IV(Medical Information Mart for Intensive Care IV) dataset. Before applying the machine learning models for the prediction, an exploratory data analysis is done to gain insights of the dataset. Then data preprocessing and feature selection methods are used to enhance the usability of the dataset. Three machine learning models are then trained for the prediction, predictive models include Logistic Regression, Decision Tree and Random Forest.
The complete code of the project can be found in ./code/ICU_Intubation_Prediction.ipynb.
The env needed for running the code can be set up based on ./requirements.txt.
The MIMIC(Medical Information Mart for Intensive Care IV) database is a vast repository of de-identified health-related data from more than 40,000 patients who were admitted to critical care units at the Beth Israel Deaconess Medical Center. This database is available free of charge and encompasses a broad range of information, including demographics, vital sign measurements taken at the bedside (approximately one data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality rates both inside and outside the hospital. The latest complete version of MIMIC-IV dataset can be found here
For this project, we will use a pre-extracted dataset of the MIMIC-IV. It contains totally 36489 patient healthcare records with 60 indicators. The pre-extracted dataset can be found at ./dataset/MIMIC_IV.csv.
For this part, I take several steps to look into the basic information of the dataset and applied several methods for data preprocessing.
These several steps include:
- Check which features are include in the dataset.
- Check the ratio of the
Truelabel toFalselabel. (From this step, we noticed the dataset is imbalanced, therefore oversampling should be applied to the dataset to avoid overfitting) - Check the ratio of missing values in each column. (For column with over
80%missing values, we directly remove it from the dataset since it cannot provide much information for the models) - Calculate the correlation between each feature and draw the correlation heatmap. (Remove columns with over
95%correlations to reduce the number of features) - Use visualizations to check the distribution of each feature. (Most of the features are not in normal distribution, which means that it will be better to use median values to replace the missing values instead of using mean values)
To further reduce the number of features used for the training of predictive models, I applied the Genetic Algorithm for the feature selection of each predictive model separately. Some important parameters set for the GA is: population_size = 30, generations = 40.
For each predictive model I recorded the 20 times accuracy score before and after feature selection to compare the performance. Besides, I also generated the confusion matrix of each model to make sure that the oversampling method works properly to help avoid overfitting.
In this part, we train three predictive models for the prediction:
- Logistic Regression
- Decision Tree
- Random Forest
For each model, I replace the missing values with mean and median values respectively to create two dataset and train the model based on each dataset. Besides, since in the EDA part we noticed that the data is imbalanced, I also applied SMOTE(Synthetic Minority Oversampling Technique) to oversample the data to balance the number of True label and False label.
Since the dataset is imbalanced and I used SMOTE to oversample it, one potential problem is that some datapoints actually doesn’t exist in the originally dataset, which means that K-Fold validation is not applicable for the dataset because if the datapoints generated by SMOTE are split to testing dataset, then the testing dataset won’t be able to test the accuracy of the model. Therefore, in order to test the accuracy of the models, I trained and test each model for 20 times before and after feature selection and calculate the mean accuracy of the model to represent the overall accuracy of each model. Besides, I also calculated the standard deviation of the 20 times accuracy of each model to compare the stability of each model before and after feature selection.
After training all the models separately and record the accuracy score of each model for 20 times, I draw visualizations of the to compare the performance of all the predictive models.
The visualizations includes:
- The overall accuracy score of all the models.
- The mean accuracy score comparison of all the models
- The standard deviation of accuracy score of all the models.
Base on the overall accuracy score and mean accuracy score of all the models, we can come up with which is the best model for predicting the patient's chance of intubation in ICU. With the standard deviation of the accuracy score all the models, we can analyze which model has a better stablity when dealing with different data.
Mean accuracy of the models filled the missing value with mean value:
| Model | Mean Accuracy Before Feature Selection | Mean Accuracy After Feature Selection |
|---|---|---|
Logistic Regression |
0.76 | 0.77 |
Decision Tree |
0.64 | 0.58 |
Random Forest |
0.72 | 0.72 |
Mean accuracy of the models filled the missing value with median value:
| Model | Mean Accuracy Before Feature Selection | Mean Accuracy After Feature Selection |
|---|---|---|
Logistic Regression |
0.76 | 0.77 |
Decision Tree |
0.92 | 0.90 |
Random Forest |
0.96 | 0.96 |
Standard deviation of the models filled the missing value with mean value:
| Model | Standard Devation Before Feature Selection | Standard Deviation After Feature Selection |
|---|---|---|
Logistic Regression |
0.01 | 0.00 |
Decision Tree |
0.21 | 0.29 |
Random Forest |
0.20 | 0.29 |
Standard deviation of the models filled the missing value with median value:
| Model | Standard Devation Before Feature Selection | Standard Deviation After Feature Selection |
|---|---|---|
Logistic Regression |
0.01 | 0.01 |
Decision Tree |
0.01 | 0.03 |
Random Forest |
0.00 | 0.01 |
(The visualization results can be found in ./results or in ./code/ICU_Intubation_Prediction.ipynb)
Based on the mean accuracy and standard deviation of the predictive models, the best model for predicting the patient’s performance of intubation in ICU is the random forest model which filled the missing value with median value. The mean accuracy of its 20 times test is 0.96 and the standard deviation of it is 0.01. It achieves the best performance by only ask for 25 features after applying the genetic algorithm for feature selection.
By comparing the accuracy of all the models, it’s clear that the random forest can always achieve the best performance, followed by decision tree and then logistic regression. The main reason why logistic regression cannot achieve an accuracy more than 80% might be that this problem is not a linear separable problem, which means that it’s difficult to find a good decision boundary to separate the datapoints. Therefore, tree-based model will be more suitable for this problem because the models use information entropy for making the decisions. Besides, with ensemble learning, the random forest model can achieve a better stability than the decision tree model.
One phenomenon we noticed is that the accuracy of tree-based models filled missing value with mean value is extremely unstable. This is a not surprising phenomenon since in the EDA part, we notice that most of the features are not in normal distribution which means that the mean value cannot represent the population very well in this case. Therefore, there is no doubt that the models filled missing value with median value can achieve a better stability.
For the feature selection, the logistic regression can achieve a better performance no matter on accuracy or stability after genetic algorithm for feature selection while tree-based model might lose a little bit stability but still achieve a good mean accuracy after genetic algorithm for feature selection. In this case, although the feature selection might cause a few influences on the stability of the tree-based models, the number of features need for training decreased by half while still achieve almost the same mean accuracy. If the computing source is limited, feature selection method can be used to reduce the feature numbers. If the computing source is sufficient and the requirement is to build a model with good accuracy and stability, then the feature selection is not necessary.


