This document describes the data versioning strategy for the MLOps project.
For this learning project, we'll use a simplified data versioning approach:
- Training Data: Stored in
data/directory (not committed to git) - Data Loading: Scripts in
src/data/handle data loading - Data Preprocessing: Preprocessing steps are versioned as code
- Data Validation: Basic validation to ensure data quality
As the project evolves, we may add:
- DVC (Data Version Control) for data versioning
- Data validation schemas
- Data lineage tracking
- Automated data quality checks