Datalad Diabetes Project

This README provides a step-by-step guide for setting up and running a Datalad project focused on a diabetes dataset. You'll learn how to manage data and code using Datalad, run scripts, and track changes in a reproducible manner.

Prerequisites

Install Datalad and ensure it's available in your PATH.
Have Python installed to run the scripts.
Install the tree command for directory visualization (optional).

Verify your Datalad installation:

datalad

Check Datalad version:

datalad --version

Setting Up the Project

Create a new directory for the project and initialize it with Datalad using the YODA (YODA's Organized Data Approach) data management practices:

mkdir diabetes
cd diabetes
datalad create -c yoda .

View the directory structure:

tree

Initializing the Project Structure

Create the necessary directories:

mkdir -p code data/raw data/processed/train data/processed/test results params

Configure Git attributes to track certain files with Git instead of Git-annex:

echo 'results/**/* annex.largefiles=nothing' >> .gitattributes
echo 'params/**/* annex.largefiles=nothing' >> .gitattributes

Save the changes to .gitattributes:

datalad save -m "Initialize project structure and edit .gitattributes" .gitattributes

View the updated directory structure:

tree

Downloading Code Files

Download the necessary Python scripts into the code directory:

datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/download_data.py -m "Downloaded download_data.py"
datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/evaluate.py -m "Downloaded evaluate.py"
datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/process_data.py -m "Downloaded process_data.py"
datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/train.py -m "Downloaded train.py"

View the directory structure with the downloaded code:

tree

Check the Git commit log:

git log

Adding Parameter Configurations

Create a config.json file with model parameters:

echo '{
    "model": "DecisionTree",
    "parameters": {
        "criterion": "gini",
        "splitter": "best",
        "max_depth": 7,
        "min_samples_split": 2,
        "min_samples_leaf": 1,
        "random_state": 42
    }
}' > params/config.json

Running the Data Download Script

Attempt to run the data download script:

datalad run -m "Run Download Data Script. Add Raw Diabetes Dataset." \
    --output "data/raw/diabetes_raw.csv" \
    "python code/download_data.py"

Note: The above command may fail because there are untracked changes in the params folder (config.json).

Save the config.json file to track it with Git:

datalad save -m "Add parameter configurations" params/config.json

Now, run the data download script again:

datalad run -m "Run Download Data Script. Add Raw Diabetes Dataset." \
    --output "data/raw/diabetes_raw.csv" \
    "python code/download_data.py"

View the directory structure:

tree

Processing the Data

Process the raw data into training and test sets:

datalad run -m "Process raw data into training and test sets" \
    --input data/raw/diabetes_raw.csv \
    --output data/processed/train/diabetes_train.csv \
    --output data/processed/test/diabetes_test.csv \
    "python code/process_data.py"

View the directory structure:

tree

Check the Git commit log:

git log

Training and Evaluating Models

Set an experiment name:

EXPERIMENT_NAME="decision_tree"

Train the Decision Tree model:

datalad run -m "Train Decision Tree model for ${EXPERIMENT_NAME}" \
    --input data/processed/train/diabetes_train.csv \
    --input params/config.json \
    --output results/${EXPERIMENT_NAME}/model.joblib \
    "python code/train.py ${EXPERIMENT_NAME}"

Evaluate the model and generate metrics and plots:

datalad run -m "Evaluate model for ${EXPERIMENT_NAME} with ROC curve plot" \
    --input data/processed/test/diabetes_test.csv \
    --input results/${EXPERIMENT_NAME}/model.joblib \
    --output results/${EXPERIMENT_NAME}/metrics.json \
    --output results/${EXPERIMENT_NAME}/predictions.csv \
    --output results/${EXPERIMENT_NAME}/roc_curve.png \
    "python code/evaluate.py ${EXPERIMENT_NAME}"

Modifying Parameters and Re-running

Change the max_depth parameter in config.json:

echo '{
    "model": "DecisionTree",
    "parameters": {
        "criterion": "gini",
        "splitter": "best",
        "max_depth": 5,
        "min_samples_split": 2,
        "min_samples_leaf": 1,
        "random_state": 42
    }
}' > params/config.json

View the directory structure:

tree

Save the changes to config.json:

datalad save -m "Change max_depth in parameters ${EXPERIMENT_NAME}" params/config.json

Re-train the model with the new parameters:

datalad run -m "Train Decision Tree model for ${EXPERIMENT_NAME}" \
    --input data/processed/train/diabetes_train.csv \
    --input params/config.json \
    --output results/${EXPERIMENT_NAME}/model.joblib \
    "python code/train.py ${EXPERIMENT_NAME}"

Re-evaluate the model:

datalad run -m "Evaluate model for ${EXPERIMENT_NAME} with ROC curve plot" \
    --input data/processed/test/diabetes_test.csv \
    --input results/${EXPERIMENT_NAME}/model.joblib \
    --output results/${EXPERIMENT_NAME}/metrics.json \
    --output results/${EXPERIMENT_NAME}/predictions.csv \
    --output results/${EXPERIMENT_NAME}/roc_curve.png \
    "python code/evaluate.py ${EXPERIMENT_NAME}"

Comparing Results

Compare the metrics and parameters between the two branches:

git diff branch_1 branch_2 -- results/${EXPERIMENT_NAME}/metrics.json params/config.json

This command will show the differences in the model metrics and parameter configurations between the two experiments.

Follow these steps to replicate the project and understand how Datalad can help manage data and code in a reproducible way. Each command is provided in its own code block for clarity and ease of use during your code-along demonstration.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
DataLad Presentation.pdf		DataLad Presentation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datalad Diabetes Project

Prerequisites

Setting Up the Project

Initializing the Project Structure

Downloading Code Files

Adding Parameter Configurations

Running the Data Download Script

Processing the Data

Training and Evaluating Models

Modifying Parameters and Re-running

Comparing Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Datalad Diabetes Project

Prerequisites

Setting Up the Project

Initializing the Project Structure

Downloading Code Files

Adding Parameter Configurations

Running the Data Download Script

Processing the Data

Training and Evaluating Models

Modifying Parameters and Re-running

Comparing Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages