This project implements an end-to-end ELT pipeline for NYC 311 service request data. The pipeline ingests raw data from the NYC Open Data API, transforms and enriches it using dbt, and produces BI-ready aggregates for analysis of complaints trends, agency performance, and geospatial distribution.
NYC 311 API → Polars → PostgreSQL (raw schema) → dbt staging → dbt intermediate → dbt marts → BI layer
- Raw Layer (
raw) - stores raw API data without modifications - Staging Layer (
staging) - cleans and normalizes columns, enforces types, derives basic fields - Intermediate Layer (
intermediate) - enriches data with derived metrics and flags (e.g.,resolution_hours,is_instant_close) - Marts Layer (
marts) - Star schema fact and dim tables, plus pre-aggregated tables for performance
- Incremental Data Ingestion - batched ingestion, query decomposition, metadata watermarking to maintain state between batches
- Reliable Data Loading - bulk inserts via temporary tables +
COPY, withON CONFLICT DO NOTHINGto prevent duplicates.
- Data Quality Checks - automated dbt tests to ensure data quality, such as
not_null,unique, andaccepted_values. - Derived Metrics & Flags - includes
resolution_hours,is_instant_close(auto-closure indicator), and other enriched fields. - Performance Optimization - pre-aggregated tables for geospatial heatmaps, performance by agency, and daily trends.
- Schema Organization - medallion architecture; layered schemas (
staging,intermediate,marts) maintain ground truth and provide clarity.
- Dockerized Environment - dbt and PostgreSQL fully containerized for reproducibility and easy deployment.
- Python, Polars (ingestion and data manipulation)
- PostgreSQL (data warehouse)
- dbt (transformations, modeling, testing)
- Docker (containerized environment)
- NYC Open Data API (data source)
- Connect a BI tool for dashboards (geo heatmaps, trends, agency/borough comparisons)
- Connect with multiple data sources (weather, other NYC APIs) to enrich data further and increase analytics capability
- Configure environment variables in
.envfor Postgres connection. - Build and run containers via Docker Compose:
docker compose up -d
- Run dbt tests and transformations
docker compose run --rm dbt dbt build
docker compose run --rm dbt dbt test
- Use psql to explore and analyze data
docker exec -it nyc311_postgres psql -U $POSTGRES_USER -d $POSTGRES_DB
- Ingestion and raw layer fully implemented and verified
- Staging, intermediate, and marts models implemented with dbt tests passing
- Incremental ingestion, metadata watermarking, and batch performance optimized
- Still need to work on BI layer (dashboards + visualizations), which will involve tweaking the mart models to fit analysis needs