Skip to content

Yash-Patil-1/DataGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ DataGuard

Automated Data Quality & Anomaly Detection Framework

Python 3.9+ MIT License Production Ready 72 tests passing CI


DataGuard is an end-to-end data quality monitoring framework that automatically detects, measures, and alerts on data quality issues. It combines statistical checks, machine learning-based anomaly detection, time-series drift monitoring, and an interactive dashboard β€” all in a single, extensible Python package.

✨ Features

Feature Description
πŸ§ͺ 6 Quality Dimensions Completeness, Uniqueness, Validity, Consistency, Timeliness, Accuracy
πŸ” 4 Anomaly Methods IQR, Z-score, Modified Z-score (MAD), Isolation Forest
πŸ“ˆ Drift Detection Rolling window analysis across 30-day time series
πŸ–₯️ Interactive Dashboard 6-page Streamlit dashboard with drill-downs and live charts
πŸ“š Data Catalog Auto-generated schema documentation with column profiles & quality flags
πŸ”” Alerting Slack webhook + SMTP email + console alerts with configurable thresholds
⏰ Scheduling Built-in continuous loop + cron-style for CI/CD integration
βœ… 72+ Unit Tests Comprehensive test coverage across all modules

πŸ“¦ Quick Start

# Clone
git clone https://github.com/Yash-Patil-1/dataguard.git
cd dataguard

# Install
pip install -r requirements.txt

# Generate sample data with intentionally injected quality issues
python src/data_generator.py

# Run quality checks
python src/validators.py

# Run anomaly detection
python src/detectors.py

# Launch the dashboard
streamlit run dashboard.py

Or use the all-in-one runner:

python run_pipeline.py
python run_pipeline.py --alert          # with Slack/Email alerts
python run_pipeline.py --schedule       # every hour

πŸ—οΈ Architecture

DataGuard/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py          # Configuration & data quality issue definitions
β”‚   β”œβ”€β”€ data_generator.py  # Synthetic data generator with quality issues
β”‚   β”œβ”€β”€ validators.py      # 6 modular quality checkers + pipeline
β”‚   β”œβ”€β”€ detectors.py       # 4 anomaly/drift detection methods + pipeline
β”‚   β”œβ”€β”€ data_catalog.py    # Auto-column profiling & data catalog
β”‚   β”œβ”€β”€ alerts.py          # Slack/Email/Console alert dispatcher
β”‚   └── utils.py           # QualityReport, scoring, reporting utilities
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_validators.py # 14 tests β€” completeness, uniqueness, validity, etc.
β”‚   β”œβ”€β”€ test_detectors.py  # 9 tests β€” IQR, Z-score, IForest, drift
β”‚   └── test_utils.py      # 8 tests β€” report, alerts, formatting
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ thresholds.yaml    # Configurable pass/fail thresholds
β”‚   └── alerts.yaml        # Alert channel configuration
β”œβ”€β”€ data/                  # Generated datasets (auto-created)
β”œβ”€β”€ reports/               # Quality reports (auto-created)
β”œβ”€β”€ dashboard.py           # Streamlit interactive dashboard
β”œβ”€β”€ run_pipeline.py        # CLI pipeline orchestrator
└── README.md              # You are here

Data Flow

Data Generation ──► Quality Checks ──► Anomaly Detection ──► Alerts
       β”‚                    β”‚                   β”‚                β”‚
       β–Ό                    β–Ό                   β–Ό                β–Ό
  CSV Files          QualityReport        Anomaly Scores     Slack/Email
  (30 days)          (JSON + TXT)         (per method)       + Console
       β”‚                    β”‚                   β”‚
       β–Ό                    β–Ό                   β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚            Streamlit Dashboard (6 pages)            β”‚
  β”‚  Overview β”‚ Data Quality β”‚ Anomalies β”‚ Drift β”‚      β”‚
  β”‚  Data Explorer β”‚ Data Catalog                        β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ§ͺ Quality Dimensions

Each dimension is implemented as a modular checker class in src/validators.py:

Dimension Checker What It Detects
Completeness CompletenessChecker Null rates per column, empty strings
Uniqueness UniquenessChecker Exact row duplicates, key column duplicates
Validity ValidityChecker Email format (regex), future dates, numeric range violations
Consistency ConsistencyChecker Referential integrity (orphan IDs), amount calculation accuracy, city-state mapping
Timeliness TimelinessChecker Data freshness (hours since last update), recent data ratio
Accuracy AccuracyChecker Distribution shift vs ground truth (mean/std comparison)

Scoring: Each check produces a 0.0–1.0 score. Dimension scores are weighted and aggregated into an Overall Quality Score with PASS (β‰₯75%), WARNING (50–75%), or CRITICAL (<50%) status.


πŸ” Anomaly Detection

Four detection methods in src/detectors.py:

Method Type Best For
IQR Statistical Univariate outliers, simple threshold-based detection
Z-score Statistical Normally distributed data, standard deviation-based
Modified Z-score Robust Data with outliers already present (MAD-based)
Isolation Forest ML-based Multivariate anomalies β€” unusual combinations of values
Drift Detection Time-series Distribution shifts over time (rolling window)

πŸ–₯️ Dashboard

The Streamlit dashboard has 6 interactive pages:

Page Features
πŸ“Š Overview KPI cards, quality dimension bar chart, missing value pie chart
πŸ§ͺ Data Quality Full checks table, pass/fail breakdown, failures by dimension
πŸ” Anomaly Detection Method comparison, top IForest anomalies, outlier counts by column
πŸ“ˆ Drift Analysis Multi-metric quality trends over 30 days, category drift, alert detail tabs
πŸ”Ž Data Explorer Null-highlighted preview, dirty vs clean comparison, per-column profiles
πŸ“š Data Catalog Schema overview, column stats & distributions, cross-table comparison
streamlit run dashboard.py

πŸ”” Alerting

Configure alerts in config/alerts.yaml or via environment variables:

# Slack
export DATAGUARD_SLACK_WEBHOOK="https://hooks.slack.com/services/..."

# Email
export DATAGUARD_SMTP_HOST="smtp.gmail.com"
export DATAGUARD_SMTP_USER="your@gmail.com"
export DATAGUARD_SMTP_PASS="app-password"
export DATAGUARD_ALERT_EMAILS="team@company.com"

# Run with alerts
python run_pipeline.py --alert

Alert triggers: critical score drop, any check failures, drift events, daily summary.


⏰ Scheduling

# Continuous loop (every hour)
python run_pipeline.py --schedule --interval 3600

# One-shot (for cron/CI)
python run_pipeline.py --alert

# Specific stages only
python run_pipeline.py --stages validate,detect

πŸ§ͺ Testing

pytest tests/ -v
# 72+ tests passed (all modules, including connectors, data catalog & history)

πŸ“ Generated Data

File Description
customers.csv 8,000 clean customer records
all_orders_combined.csv ~50,000 dirty orders with injected issues
ground_truth_orders.csv ~50,000 clean orders (no issues) for comparison
daily_orders_01–30.csv 30 daily snapshots with escalating quality issue rates
quality_issues_log.json Audit trail of all injected quality issues

Quality issues escalate over the 30-day period: null rates increase from 3–12% to 6–24%, duplicate rates from 3% to 9%, email invalidity from 5% to 11%, and category distributions drift.


πŸ› οΈ Configuration

All thresholds are configurable via YAML:

# config/thresholds.yaml
completeness:
  max_null_rate:
    _default: 0.10
    customer_email: 0.15
    unit_price: 0.10
  min_completeness_score: 0.85

scoring:
  dimension_weights:
    completeness: 0.25
    validity: 0.25
    uniqueness: 0.20
    consistency: 0.15
    timeliness: 0.15
  severity:
    critical: 0.50
    warning: 0.75

πŸ“Š Sample Output

============================================================
  Quality Report: all_orders_combined
  Timestamp:      2026-05-26 11:10:17
============================================================
  Overall Score:  56.4%  [WARNING]
============================================================
   Dimension                  Score     Status
  ---------------------------------------------
   accuracy                  43.3%       FAIL
   anomaly_drift              0.0%       FAIL
   anomaly_iforest          100.0%       PASS
   anomaly_iqr               44.0%       FAIL
   anomaly_zscore            60.7%       WARN
   completeness              74.7%       WARN
   consistency               86.3%       PASS
   timeliness                52.4%       WARN
   uniqueness                 0.0%       FAIL
   validity                  67.6%       WARN
  ---------------------------------------------
  Checks: 35 total, 24 passed, 11 failed
============================================================

πŸ“„ License

MIT License


🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing)
  3. Run tests (pytest tests/ -v)
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing)
  6. Open a Pull Request

πŸ“¬ Contact

Yash Patil


Built with ❀️ for reliable data pipelines

About

Automated Data Quality & Anomaly Detection Framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages