DataGuard is an end-to-end data quality monitoring framework that automatically detects, measures, and alerts on data quality issues. It combines statistical checks, machine learning-based anomaly detection, time-series drift monitoring, and an interactive dashboard β all in a single, extensible Python package.
| Feature | Description |
|---|---|
| π§ͺ 6 Quality Dimensions | Completeness, Uniqueness, Validity, Consistency, Timeliness, Accuracy |
| π 4 Anomaly Methods | IQR, Z-score, Modified Z-score (MAD), Isolation Forest |
| π Drift Detection | Rolling window analysis across 30-day time series |
| π₯οΈ Interactive Dashboard | 6-page Streamlit dashboard with drill-downs and live charts |
| π Data Catalog | Auto-generated schema documentation with column profiles & quality flags |
| π Alerting | Slack webhook + SMTP email + console alerts with configurable thresholds |
| β° Scheduling | Built-in continuous loop + cron-style for CI/CD integration |
| β 72+ Unit Tests | Comprehensive test coverage across all modules |
# Clone
git clone https://github.com/Yash-Patil-1/dataguard.git
cd dataguard
# Install
pip install -r requirements.txt
# Generate sample data with intentionally injected quality issues
python src/data_generator.py
# Run quality checks
python src/validators.py
# Run anomaly detection
python src/detectors.py
# Launch the dashboard
streamlit run dashboard.pyOr use the all-in-one runner:
python run_pipeline.py
python run_pipeline.py --alert # with Slack/Email alerts
python run_pipeline.py --schedule # every hourDataGuard/
βββ src/
β βββ config.py # Configuration & data quality issue definitions
β βββ data_generator.py # Synthetic data generator with quality issues
β βββ validators.py # 6 modular quality checkers + pipeline
β βββ detectors.py # 4 anomaly/drift detection methods + pipeline
β βββ data_catalog.py # Auto-column profiling & data catalog
β βββ alerts.py # Slack/Email/Console alert dispatcher
β βββ utils.py # QualityReport, scoring, reporting utilities
βββ tests/
β βββ test_validators.py # 14 tests β completeness, uniqueness, validity, etc.
β βββ test_detectors.py # 9 tests β IQR, Z-score, IForest, drift
β βββ test_utils.py # 8 tests β report, alerts, formatting
βββ config/
β βββ thresholds.yaml # Configurable pass/fail thresholds
β βββ alerts.yaml # Alert channel configuration
βββ data/ # Generated datasets (auto-created)
βββ reports/ # Quality reports (auto-created)
βββ dashboard.py # Streamlit interactive dashboard
βββ run_pipeline.py # CLI pipeline orchestrator
βββ README.md # You are here
Data Generation βββΊ Quality Checks βββΊ Anomaly Detection βββΊ Alerts
β β β β
βΌ βΌ βΌ βΌ
CSV Files QualityReport Anomaly Scores Slack/Email
(30 days) (JSON + TXT) (per method) + Console
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit Dashboard (6 pages) β
β Overview β Data Quality β Anomalies β Drift β β
β Data Explorer β Data Catalog β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each dimension is implemented as a modular checker class in src/validators.py:
| Dimension | Checker | What It Detects |
|---|---|---|
| Completeness | CompletenessChecker |
Null rates per column, empty strings |
| Uniqueness | UniquenessChecker |
Exact row duplicates, key column duplicates |
| Validity | ValidityChecker |
Email format (regex), future dates, numeric range violations |
| Consistency | ConsistencyChecker |
Referential integrity (orphan IDs), amount calculation accuracy, city-state mapping |
| Timeliness | TimelinessChecker |
Data freshness (hours since last update), recent data ratio |
| Accuracy | AccuracyChecker |
Distribution shift vs ground truth (mean/std comparison) |
Scoring: Each check produces a 0.0β1.0 score. Dimension scores are weighted and aggregated into an Overall Quality Score with PASS (β₯75%), WARNING (50β75%), or CRITICAL (<50%) status.
Four detection methods in src/detectors.py:
| Method | Type | Best For |
|---|---|---|
| IQR | Statistical | Univariate outliers, simple threshold-based detection |
| Z-score | Statistical | Normally distributed data, standard deviation-based |
| Modified Z-score | Robust | Data with outliers already present (MAD-based) |
| Isolation Forest | ML-based | Multivariate anomalies β unusual combinations of values |
| Drift Detection | Time-series | Distribution shifts over time (rolling window) |
The Streamlit dashboard has 6 interactive pages:
| Page | Features |
|---|---|
| π Overview | KPI cards, quality dimension bar chart, missing value pie chart |
| π§ͺ Data Quality | Full checks table, pass/fail breakdown, failures by dimension |
| π Anomaly Detection | Method comparison, top IForest anomalies, outlier counts by column |
| π Drift Analysis | Multi-metric quality trends over 30 days, category drift, alert detail tabs |
| π Data Explorer | Null-highlighted preview, dirty vs clean comparison, per-column profiles |
| π Data Catalog | Schema overview, column stats & distributions, cross-table comparison |
streamlit run dashboard.pyConfigure alerts in config/alerts.yaml or via environment variables:
# Slack
export DATAGUARD_SLACK_WEBHOOK="https://hooks.slack.com/services/..."
# Email
export DATAGUARD_SMTP_HOST="smtp.gmail.com"
export DATAGUARD_SMTP_USER="your@gmail.com"
export DATAGUARD_SMTP_PASS="app-password"
export DATAGUARD_ALERT_EMAILS="team@company.com"
# Run with alerts
python run_pipeline.py --alertAlert triggers: critical score drop, any check failures, drift events, daily summary.
# Continuous loop (every hour)
python run_pipeline.py --schedule --interval 3600
# One-shot (for cron/CI)
python run_pipeline.py --alert
# Specific stages only
python run_pipeline.py --stages validate,detectpytest tests/ -v
# 72+ tests passed (all modules, including connectors, data catalog & history)| File | Description |
|---|---|
customers.csv |
8,000 clean customer records |
all_orders_combined.csv |
~50,000 dirty orders with injected issues |
ground_truth_orders.csv |
~50,000 clean orders (no issues) for comparison |
daily_orders_01β30.csv |
30 daily snapshots with escalating quality issue rates |
quality_issues_log.json |
Audit trail of all injected quality issues |
Quality issues escalate over the 30-day period: null rates increase from 3β12% to 6β24%, duplicate rates from 3% to 9%, email invalidity from 5% to 11%, and category distributions drift.
All thresholds are configurable via YAML:
# config/thresholds.yaml
completeness:
max_null_rate:
_default: 0.10
customer_email: 0.15
unit_price: 0.10
min_completeness_score: 0.85
scoring:
dimension_weights:
completeness: 0.25
validity: 0.25
uniqueness: 0.20
consistency: 0.15
timeliness: 0.15
severity:
critical: 0.50
warning: 0.75============================================================
Quality Report: all_orders_combined
Timestamp: 2026-05-26 11:10:17
============================================================
Overall Score: 56.4% [WARNING]
============================================================
Dimension Score Status
---------------------------------------------
accuracy 43.3% FAIL
anomaly_drift 0.0% FAIL
anomaly_iforest 100.0% PASS
anomaly_iqr 44.0% FAIL
anomaly_zscore 60.7% WARN
completeness 74.7% WARN
consistency 86.3% PASS
timeliness 52.4% WARN
uniqueness 0.0% FAIL
validity 67.6% WARN
---------------------------------------------
Checks: 35 total, 24 passed, 11 failed
============================================================
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing) - Run tests (
pytest tests/ -v) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing) - Open a Pull Request
Yash Patil
- π§ yashpatil7714@gmail.com
- π LinkedIn
- π GitHub