Enterprise 5-layer Medallion data platform consolidating multi-source account data into a single analytics-ready layer — delivering 88% ETL performance improvement and enabling real-time insights across Tableau, React dashboards, Athena, and QuickSight.
flowchart TD
subgraph Sources["📥 Source Systems"]
A1[MongoDB\nAccount Records]
A2[Operational DBs\nTransactions · CRM]
A3[External APIs\nEnrichment Data]
A4[Flat Files\nCSV · JSON · Parquet]
end
subgraph Layer1["Layer 1 — Landing Zone (S3)"]
B[S3 Landing\nRaw as-received files\nPartitioned by date/source]
end
subgraph Layer2["Layer 2 — Raw (S3 + Glue Catalog)"]
C[AWS Glue Crawler\nSchema Discovery]
D[S3 Raw\nValidated schema\nNo transformation]
end
subgraph Layer3["Layer 3 — Trusted (AWS Glue ETL + PySpark)"]
E[AWS Glue ETL\nPySpark Transforms]
F[S3 Trusted\nCleaned · Deduplicated\nType-cast · Validated]
G[DynamoDB\nWatermark Store\nIncremental tracking]
end
subgraph Layer4["Layer 4 — Enriched (AWS Glue + PySpark)"]
H[Enrichment Jobs\nJoins · Aggregations\nBusiness Logic]
I[S3 Enriched\nFeature-complete records\nStar Schema ready]
end
subgraph Layer5["Layer 5 — Business (Redshift + Serving)"]
J[Amazon Redshift\n15+ Dimension Tables\n8 Fact Tables · SCD Type 2]
K[Amazon Athena\nAd-hoc SQL]
L[Amazon QuickSight\nExecutive Dashboards]
M[Tableau\nAnalyst Dashboards]
N[React Dashboard\nOperational UI]
end
subgraph Orchestration["⚙️ Orchestration & Monitoring"]
O[AWS Step Functions\nPipeline Orchestration]
P[AWS Lambda\nError Handling · Alerts]
Q[CloudWatch\nMonitoring · Alerting]
end
A1 & A2 & A3 & A4 --> B
B --> C --> D
D --> E
G --> E
E --> F
F --> H --> I
I --> J
J --> K & L & M & N
O --> E & H & P
Q -.->|Metrics| O
style E fill:#FF9900,stroke:#333,color:#000
style J fill:#8C4FFF,stroke:#333,color:#fff
style O fill:#FF4F00,stroke:#333,color:#fff
Account data was fragmented across MongoDB, operational databases, external APIs, and flat file exports — with no single source of truth. Analysts spent 60%+ of their time reconciling data across systems. ETL jobs took 50+ minutes to complete and had no lineage tracking or automated error recovery.
A 5-layer Medallion architecture on AWS consolidating all account data sources into a structured pipeline with automated schema discovery, PySpark-based transformation with broadcast join optimisation, SCD Type 2 history tracking, and multi-channel data consumption — reducing ETL execution from 50 minutes to 6 minutes (88% improvement).
| Metric | Value |
|---|---|
| ETL execution time | 50 min → 6 min (88% reduction) |
| Data sources consolidated | 4+ (MongoDB, DBs, APIs, files) |
| Dimension tables | 15+ |
| Fact tables | 8 |
| Historical tracking | SCD Type 2 across all entities |
| Consumption channels | 4 (QuickSight, Tableau, Athena, React) |
| Pipeline availability | 99.9% |
| Layer | Technology |
|---|---|
| Ingestion | AWS Glue Crawler, S3 Landing Zone |
| Metadata | AWS Glue Data Catalog, DynamoDB (watermarks) |
| Transformation | AWS Glue ETL (PySpark) — broadcast joins, caching, resource allocation |
| Orchestration | AWS Step Functions, AWS Lambda |
| Storage | Amazon S3 (5-layer: Landing → Raw → Trusted → Enriched → Business) |
| Warehouse | Amazon Redshift (star schema, SCD Type 2) |
| Query | Amazon Athena |
| Visualisation | Amazon QuickSight, Tableau, React |
| Monitoring | Amazon CloudWatch, SNS |
| CI/CD | GitHub Actions |
account-360-analytics-platform/
├── ingestion/
│ ├── mongodb_extractor.py # MongoDB → S3 Landing
│ ├── api_ingestion.py # REST API → S3 Landing
│ └── file_ingestion.py # Flat file → S3 Landing
├── glue/
│ ├── jobs/
│ │ ├── raw_layer.py # Landing → Raw (schema validation)
│ │ ├── trusted_layer.py # Raw → Trusted (cleanse, dedupe, cast)
│ │ ├── enriched_layer.py # Trusted → Enriched (joins, business logic)
│ │ └── business_layer.py # Enriched → Redshift (SCD Type 2 loads)
│ └── utils/
│ ├── spark_optimiser.py # Broadcast join helpers, caching logic
│ └── watermark_manager.py # DynamoDB watermark R/W
├── redshift/
│ ├── ddl/
│ │ ├── dimensions/ # 15+ dimension table DDLs
│ │ └── facts/ # 8 fact table DDLs
│ ├── stored_procedures/
│ │ └── scd_type2_merge.sql # SCD Type 2 upsert pattern
│ └── data_models/
│ ├── conceptual_model.md
│ ├── logical_model.md
│ └── physical_model.md
├── step_functions/
│ └── pipeline_definition.json # Full 5-layer orchestration
├── quality/
│ ├── validation_rules.yaml # Per-layer quality rules
│ └── quality_monitor.py # Automated quality checks
├── tests/
│ ├── test_transformations.py
│ ├── test_scd_type2.py
│ └── test_quality_checks.py
├── .github/
│ └── workflows/
│ ├── ci.yml
│ └── deploy.yml
└── README.md
- 5-layer Medallion over simple staging — separating Landing, Raw, Trusted, Enriched, and Business layers enables independent replay at any stage without reprocessing from source, critical for schema drift recovery.
- Broadcast join optimisation — small dimension tables broadcast to all Spark executors eliminated shuffle overhead, primary driver of 88% performance gain.
- DynamoDB as watermark store — serverless, low-latency watermark tracking enables reliable incremental loads without managing a dedicated metadata database.
- SCD Type 2 in Redshift — full historical tracking of account attribute changes enables time-travel analysis and audit compliance without separate snapshot tables.
# Clone the repo
git clone https://github.com/jesseantony/account-360-analytics-platform.git
cd account-360-analytics-platform
# Configure AWS credentials
aws configure
# Deploy infrastructure
aws cloudformation deploy \
--template-file infra/stack.yaml \
--stack-name account-360 \
--capabilities CAPABILITY_IAM
# Deploy Glue jobs
python scripts/deploy.py --env dev
# Run tests
pip install -r requirements.txt
pytest tests/ -vaws aws-glue pyspark medallion-architecture redshift step-functions data-engineering etl star-schema scd-type-2 python athena quicksight