Skip to content

jesseantony/account-360-analytics-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Account 360 Analytics Platform

Python PySpark AWS Amazon Redshift License Status

Enterprise 5-layer Medallion data platform consolidating multi-source account data into a single analytics-ready layer — delivering 88% ETL performance improvement and enabling real-time insights across Tableau, React dashboards, Athena, and QuickSight.


📐 Architecture

flowchart TD
    subgraph Sources["📥 Source Systems"]
        A1[MongoDB\nAccount Records]
        A2[Operational DBs\nTransactions · CRM]
        A3[External APIs\nEnrichment Data]
        A4[Flat Files\nCSV · JSON · Parquet]
    end

    subgraph Layer1["Layer 1 — Landing Zone (S3)"]
        B[S3 Landing\nRaw as-received files\nPartitioned by date/source]
    end

    subgraph Layer2["Layer 2 — Raw (S3 + Glue Catalog)"]
        C[AWS Glue Crawler\nSchema Discovery]
        D[S3 Raw\nValidated schema\nNo transformation]
    end

    subgraph Layer3["Layer 3 — Trusted (AWS Glue ETL + PySpark)"]
        E[AWS Glue ETL\nPySpark Transforms]
        F[S3 Trusted\nCleaned · Deduplicated\nType-cast · Validated]
        G[DynamoDB\nWatermark Store\nIncremental tracking]
    end

    subgraph Layer4["Layer 4 — Enriched (AWS Glue + PySpark)"]
        H[Enrichment Jobs\nJoins · Aggregations\nBusiness Logic]
        I[S3 Enriched\nFeature-complete records\nStar Schema ready]
    end

    subgraph Layer5["Layer 5 — Business (Redshift + Serving)"]
        J[Amazon Redshift\n15+ Dimension Tables\n8 Fact Tables · SCD Type 2]
        K[Amazon Athena\nAd-hoc SQL]
        L[Amazon QuickSight\nExecutive Dashboards]
        M[Tableau\nAnalyst Dashboards]
        N[React Dashboard\nOperational UI]
    end

    subgraph Orchestration["⚙️ Orchestration & Monitoring"]
        O[AWS Step Functions\nPipeline Orchestration]
        P[AWS Lambda\nError Handling · Alerts]
        Q[CloudWatch\nMonitoring · Alerting]
    end

    A1 & A2 & A3 & A4 --> B
    B --> C --> D
    D --> E
    G --> E
    E --> F
    F --> H --> I
    I --> J
    J --> K & L & M & N
    O --> E & H & P
    Q -.->|Metrics| O

    style E fill:#FF9900,stroke:#333,color:#000
    style J fill:#8C4FFF,stroke:#333,color:#fff
    style O fill:#FF4F00,stroke:#333,color:#fff
Loading

🎯 Problem Statement

Account data was fragmented across MongoDB, operational databases, external APIs, and flat file exports — with no single source of truth. Analysts spent 60%+ of their time reconciling data across systems. ETL jobs took 50+ minutes to complete and had no lineage tracking or automated error recovery.

✅ Solution

A 5-layer Medallion architecture on AWS consolidating all account data sources into a structured pipeline with automated schema discovery, PySpark-based transformation with broadcast join optimisation, SCD Type 2 history tracking, and multi-channel data consumption — reducing ETL execution from 50 minutes to 6 minutes (88% improvement).


📊 Key Metrics

Metric Value
ETL execution time 50 min → 6 min (88% reduction)
Data sources consolidated 4+ (MongoDB, DBs, APIs, files)
Dimension tables 15+
Fact tables 8
Historical tracking SCD Type 2 across all entities
Consumption channels 4 (QuickSight, Tableau, Athena, React)
Pipeline availability 99.9%

🛠️ Tech Stack

Layer Technology
Ingestion AWS Glue Crawler, S3 Landing Zone
Metadata AWS Glue Data Catalog, DynamoDB (watermarks)
Transformation AWS Glue ETL (PySpark) — broadcast joins, caching, resource allocation
Orchestration AWS Step Functions, AWS Lambda
Storage Amazon S3 (5-layer: Landing → Raw → Trusted → Enriched → Business)
Warehouse Amazon Redshift (star schema, SCD Type 2)
Query Amazon Athena
Visualisation Amazon QuickSight, Tableau, React
Monitoring Amazon CloudWatch, SNS
CI/CD GitHub Actions

📁 Project Structure

account-360-analytics-platform/
├── ingestion/
│   ├── mongodb_extractor.py          # MongoDB → S3 Landing
│   ├── api_ingestion.py              # REST API → S3 Landing
│   └── file_ingestion.py             # Flat file → S3 Landing
├── glue/
│   ├── jobs/
│   │   ├── raw_layer.py              # Landing → Raw (schema validation)
│   │   ├── trusted_layer.py          # Raw → Trusted (cleanse, dedupe, cast)
│   │   ├── enriched_layer.py         # Trusted → Enriched (joins, business logic)
│   │   └── business_layer.py         # Enriched → Redshift (SCD Type 2 loads)
│   └── utils/
│       ├── spark_optimiser.py        # Broadcast join helpers, caching logic
│       └── watermark_manager.py      # DynamoDB watermark R/W
├── redshift/
│   ├── ddl/
│   │   ├── dimensions/               # 15+ dimension table DDLs
│   │   └── facts/                    # 8 fact table DDLs
│   ├── stored_procedures/
│   │   └── scd_type2_merge.sql       # SCD Type 2 upsert pattern
│   └── data_models/
│       ├── conceptual_model.md
│       ├── logical_model.md
│       └── physical_model.md
├── step_functions/
│   └── pipeline_definition.json      # Full 5-layer orchestration
├── quality/
│   ├── validation_rules.yaml         # Per-layer quality rules
│   └── quality_monitor.py            # Automated quality checks
├── tests/
│   ├── test_transformations.py
│   ├── test_scd_type2.py
│   └── test_quality_checks.py
├── .github/
│   └── workflows/
│       ├── ci.yml
│       └── deploy.yml
└── README.md

⚙️ Key Engineering Decisions

  • 5-layer Medallion over simple staging — separating Landing, Raw, Trusted, Enriched, and Business layers enables independent replay at any stage without reprocessing from source, critical for schema drift recovery.
  • Broadcast join optimisation — small dimension tables broadcast to all Spark executors eliminated shuffle overhead, primary driver of 88% performance gain.
  • DynamoDB as watermark store — serverless, low-latency watermark tracking enables reliable incremental loads without managing a dedicated metadata database.
  • SCD Type 2 in Redshift — full historical tracking of account attribute changes enables time-travel analysis and audit compliance without separate snapshot tables.

🚀 Quick Start

# Clone the repo
git clone https://github.com/jesseantony/account-360-analytics-platform.git
cd account-360-analytics-platform

# Configure AWS credentials
aws configure

# Deploy infrastructure
aws cloudformation deploy \
  --template-file infra/stack.yaml \
  --stack-name account-360 \
  --capabilities CAPABILITY_IAM

# Deploy Glue jobs
python scripts/deploy.py --env dev

# Run tests
pip install -r requirements.txt
pytest tests/ -v

📌 Topics

aws aws-glue pyspark medallion-architecture redshift step-functions data-engineering etl star-schema scd-type-2 python athena quicksight

About

Enterprise 5-layer Medallion pipeline on AWS | 88% ETL improvement (50 min → 6 min) | AWS Glue · PySpark · Redshift · Athena

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors