This project involves the processing and modeling of ICU data to predict death or ICU admission based on various physiological measurements and other time series based features. The pipeline involves multiple steps including data extraction, transformation, feature engineering, and model training to achieve reliable predictions.
An API deployment of the models trained here can be found at https://github.com/ATayls/DEWS_fastapi
- Installation
- Data Pipeline
- Configuration
- Usage
- Results
- Plots
- Export
- Files and Directories
- [Citation] (#citation)
To run this project, you need to have Python installed along with the following libraries:
To install dependencies using pip with a requirements.txt file, follow these steps:
-
Ensure you have Python and pip installed: You can check by running:
python --version pip --version
-
Install dependencies: Navigate to your project directory where the
requirements.txtfile is located and run:pip install -r requirements.txt
This command will install all the dependencies listed in the requirements.txt file.
To install dependencies using Poetry with a pyproject.toml file, follow these steps:
-
Install Poetry: If you haven't already installed Poetry, you can do so by running:
curl -sSL https://install.python-poetry.org | python3 - -
Navigate to your project directory: Ensure you are in the directory where your
pyproject.tomlfile is located. -
Install dependencies: Run the following command to install all dependencies specified in the
pyproject.tomlfile:poetry install
-
Activate the virtual environment (optional): To work within the virtual environment managed by Poetry, you can run:
poetry shell
This will create and activate a virtual environment with all the dependencies installed.
The ETL (Extract, Transform, Load) process handles loading, preprocessing, and feature engineering of the data. The ETL function:
- Loads the data from the specified filename.
- Applies preprocessing steps.
- Creates additional features.
- Saves or loads the processed dataset.
The feature engineering process involves creating time series features, calculating rolling averages, standard deviations, and slopes. It includes functions like:
create_time_delta: Creates a time variable.create_diff: Calculates differences from previous values.create_rolling: Calculates rolling averages and standard deviations.create_expanding: Calculates expanding averages and standard deviations.create_ts_base_features: Combines multiple time series features.create_slopes_cached: Calculates slopes for variables.
The model training process involves training logistic regression models using cross-validation and bootstrapping techniques. It includes:
run_lr_train: Trains a logistic regression model.train_logistic_model_cv: Performs cross-validation.train_logistic_model_bootstrapped: Uses bootstrapping for model training.train_logistic_model_CV_grouped: Cross-validation with non-overlapping patient groups.
Configuration settings are handled in the settings.py file, including directory paths for data, processed data, saved results, plots, and models.
To run the main experiment, execute the run.py script:
python run.pyThis script will perform the following steps:
- Load and preprocess the training and testing data.
- Perform feature engineering on the data.
- Train logistic regression models using both cross-validation and bootstrapping.
- Evaluate the models on the test set.
- Save the results, models, and plots.
The results of the model training and evaluation are saved in CSV format in the SAVED_RESULTS_DIR directory. The results include metrics such as AUROC and AUPRC along with confidence intervals.
The script generates several plots to visualize the model performance and feature importance:
- ROC and PR curves for cross-validation and test sets.
- Permutation importance plots.
- SHAP value summaries.
These plots are saved in the PLOTS_DIR directory.
The processed data, model predictions, and metrics can be exported to Excel files for further analysis. This is handled by the export_as_excel.py module.
-
run.py: Main script to run the experiment. -
preprocessing.py: Contains preprocessing functions. -
feature_engineering.py: Contains feature engineering functions. -
train.py: Contains model training functions. -
settings.py: Configuration settings. -
plots.py: Functions to generate plots. -
export_as_excel.py: Functions to export results to Excel. -
evaluation.py: Utilities around model evaluation. -
utils.py: General Utilities.
If you use this repository, please cite it as follows:
Taylor, A. (2022). ICU Data Analysis [Source code]. GitHub. https://github.com/ATayls/ICU_data_analysis
This repository is part of the work published in Respiratory Research:
Gonem, S., Taylor, A., et al. (2022). Dynamic early warning scores for predicting clinical deterioration in patients with respiratory disease. Respiratory Research, 23(1), Article 130. https://doi.org/10.1186/s12931-022-02130-6
This project is licensed under the MIT License. See the LICENSE file for more details.