This README provides a step-by-step guide for setting up and running a Datalad project focused on a diabetes dataset. You'll learn how to manage data and code using Datalad, run scripts, and track changes in a reproducible manner.
- Install Datalad and ensure it's available in your PATH.
- Have Python installed to run the scripts.
- Install the
treecommand for directory visualization (optional).
Verify your Datalad installation:
dataladCheck Datalad version:
datalad --versionCreate a new directory for the project and initialize it with Datalad using the YODA (YODA's Organized Data Approach) data management practices:
mkdir diabetes
cd diabetes
datalad create -c yoda .View the directory structure:
treeCreate the necessary directories:
mkdir -p code data/raw data/processed/train data/processed/test results paramsConfigure Git attributes to track certain files with Git instead of Git-annex:
echo 'results/**/* annex.largefiles=nothing' >> .gitattributes
echo 'params/**/* annex.largefiles=nothing' >> .gitattributesSave the changes to .gitattributes:
datalad save -m "Initialize project structure and edit .gitattributes" .gitattributesView the updated directory structure:
treeDownload the necessary Python scripts into the code directory:
datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/download_data.py -m "Downloaded download_data.py"
datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/evaluate.py -m "Downloaded evaluate.py"
datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/process_data.py -m "Downloaded process_data.py"
datalad download-url -O code/ https://raw.githubusercontent.com/bhanuprasanna2001/datalad-demo/master/code/train.py -m "Downloaded train.py"View the directory structure with the downloaded code:
treeCheck the Git commit log:
git logCreate a config.json file with model parameters:
echo '{
"model": "DecisionTree",
"parameters": {
"criterion": "gini",
"splitter": "best",
"max_depth": 7,
"min_samples_split": 2,
"min_samples_leaf": 1,
"random_state": 42
}
}' > params/config.jsonAttempt to run the data download script:
datalad run -m "Run Download Data Script. Add Raw Diabetes Dataset." \
--output "data/raw/diabetes_raw.csv" \
"python code/download_data.py"Note: The above command may fail because there are untracked changes in the params folder (config.json).
Save the config.json file to track it with Git:
datalad save -m "Add parameter configurations" params/config.jsonNow, run the data download script again:
datalad run -m "Run Download Data Script. Add Raw Diabetes Dataset." \
--output "data/raw/diabetes_raw.csv" \
"python code/download_data.py"View the directory structure:
treeProcess the raw data into training and test sets:
datalad run -m "Process raw data into training and test sets" \
--input data/raw/diabetes_raw.csv \
--output data/processed/train/diabetes_train.csv \
--output data/processed/test/diabetes_test.csv \
"python code/process_data.py"View the directory structure:
treeCheck the Git commit log:
git logSet an experiment name:
EXPERIMENT_NAME="decision_tree"Train the Decision Tree model:
datalad run -m "Train Decision Tree model for ${EXPERIMENT_NAME}" \
--input data/processed/train/diabetes_train.csv \
--input params/config.json \
--output results/${EXPERIMENT_NAME}/model.joblib \
"python code/train.py ${EXPERIMENT_NAME}"Evaluate the model and generate metrics and plots:
datalad run -m "Evaluate model for ${EXPERIMENT_NAME} with ROC curve plot" \
--input data/processed/test/diabetes_test.csv \
--input results/${EXPERIMENT_NAME}/model.joblib \
--output results/${EXPERIMENT_NAME}/metrics.json \
--output results/${EXPERIMENT_NAME}/predictions.csv \
--output results/${EXPERIMENT_NAME}/roc_curve.png \
"python code/evaluate.py ${EXPERIMENT_NAME}"Change the max_depth parameter in config.json:
echo '{
"model": "DecisionTree",
"parameters": {
"criterion": "gini",
"splitter": "best",
"max_depth": 5,
"min_samples_split": 2,
"min_samples_leaf": 1,
"random_state": 42
}
}' > params/config.jsonView the directory structure:
treeSave the changes to config.json:
datalad save -m "Change max_depth in parameters ${EXPERIMENT_NAME}" params/config.jsonRe-train the model with the new parameters:
datalad run -m "Train Decision Tree model for ${EXPERIMENT_NAME}" \
--input data/processed/train/diabetes_train.csv \
--input params/config.json \
--output results/${EXPERIMENT_NAME}/model.joblib \
"python code/train.py ${EXPERIMENT_NAME}"Re-evaluate the model:
datalad run -m "Evaluate model for ${EXPERIMENT_NAME} with ROC curve plot" \
--input data/processed/test/diabetes_test.csv \
--input results/${EXPERIMENT_NAME}/model.joblib \
--output results/${EXPERIMENT_NAME}/metrics.json \
--output results/${EXPERIMENT_NAME}/predictions.csv \
--output results/${EXPERIMENT_NAME}/roc_curve.png \
"python code/evaluate.py ${EXPERIMENT_NAME}"Compare the metrics and parameters between the two branches:
git diff branch_1 branch_2 -- results/${EXPERIMENT_NAME}/metrics.json params/config.jsonThis command will show the differences in the model metrics and parameter configurations between the two experiments.
Follow these steps to replicate the project and understand how Datalad can help manage data and code in a reproducible way. Each command is provided in its own code block for clarity and ease of use during your code-along demonstration.