End-to-end template machine learning workflow orchestrated with Jupyter notebooks and Papermill. The project demonstrates how to modularise a typical Kaggle-style pipeline (Titanic survival prediction) into reusable notebooks that can be parameterised and executed programmatically. The workflow supports switching between Pandas, Modin (Ray), and Dask dataframe engines, and uses Optuna for model hyperparameter optimisation.
notebookml/ # Shared Python utilities (backend abstraction)
notebooks/ # Parameterised workflow notebooks
01_data_preparation.ipynb
02_feature_engineering.ipynb
03_model_building.ipynb
04_model_evaluation.ipynb
05_orchestrator.ipynb # Runs the entire pipeline via Papermill
requirements.txt # Python dependencies
The pipeline stages persist their outputs in the data/ and models/ folders. These directories are ignored by default so that artefacts generated during execution are not tracked by Git.
-
Create a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Run the orchestrator to execute the full workflow with Papermill:
papermill notebooks/05_orchestrator.ipynb notebooks/runs/latest/orchestrator-output.ipynb \ -p engine pandas \ -p modin_engine ray \ -p dataset_url https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv \ -p n_trials 20
The orchestrator will sequentially execute all stage notebooks using the provided parameters. Outputs include:
- Cleaned dataset (
data/processed.csv) - Feature engineered train/test splits (
data/train_features.csv,data/test_features.csv) - Optimised model artefacts (
models/random_forest.pkl,models/optuna_trials.csv,models/best_params.json) - Evaluation metrics (
models/metrics.json)
- Cleaned dataset (
-
Customise the execution backend by changing the
engineparameter to one ofpandas,modin, ordask. Additional parameters (e.g.test_size,random_state,n_trials) can also be provided when calling Papermill.
| Notebook | Purpose | Key technologies |
|---|---|---|
01_data_preparation.ipynb |
Downloads and cleans the Titanic dataset, deriving helper features and summaries. | BackendManager, Pandas/Modin/Dask |
02_feature_engineering.ipynb |
Generates model-ready features, train/test splits, and metadata. | Pandas, scikit-learn |
03_model_building.ipynb |
Tunes a RandomForestClassifier with Optuna and persists the trained model. | Optuna, scikit-learn |
04_model_evaluation.ipynb |
Computes standard classification metrics on the held-out test set. | scikit-learn |
05_orchestrator.ipynb |
Orchestrates the pipeline using Papermill. | Papermill |
- Add new feature engineering transformations in the second notebook and expose new parameters through Papermill.
- Swap the model in
03_model_building.ipynbfor alternative estimators or additional Optuna search spaces. - Integrate experiment tracking, model registries, or deployment notebooks by extending the orchestrator sequence.
This project is released under the terms of the MIT License. See LICENSE for details.