Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# 🍔 Casper's Kitchens

[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=flat&logo=python&logoColor=white)](https://www.python.org/)
[![PySpark](https://img.shields.io/badge/PySpark-3.5+-E25A1C?style=flat&logo=apache-spark&logoColor=white)](https://spark.apache.org/)
[![Databricks](https://img.shields.io/badge/Databricks-Platform-FF3621?style=flat&logo=databricks&logoColor=white)](https://www.databricks.com/)
[![Delta Lake](https://img.shields.io/badge/Delta_Lake-Latest-00ADD8?style=flat&logo=delta&logoColor=white)](https://delta.io/)
[![MLflow](https://img.shields.io/badge/MLflow-Agent_Tracking-0194E2?style=flat&logo=mlflow&logoColor=white)](https://mlflow.org/)
[![LangChain](https://img.shields.io/badge/LangChain-AI_Agent-121212?style=flat&logo=chainlink&logoColor=white)](https://www.langchain.com/)
[![FastAPI](https://img.shields.io/badge/FastAPI-Web_App-009688?style=flat&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-Lakebase-4169E1?style=flat&logo=postgresql&logoColor=white)](https://www.postgresql.org/)
[![License](https://img.shields.io/badge/License-Databricks-00A4EF?style=flat)](https://databricks.com/db-license-source)

---

Spin up a fully working ghost-kitchen business on Databricks in minutes.

Casper's Kitchens is a simulated food-delivery platform that shows off the full power of Databricks: streaming ingestion, Lakeflow Declarative Pipelines, AI/BI Dashboards and Genie, Agent Bricks, and real-time apps backed by Lakebase postgres — all stitched together into one narrative.
Expand All @@ -25,6 +37,8 @@ Then open Databricks and watch:

That's it! Your Casper's Kitchens environment will be up and running.

> 📖 **[View Complete Documentation](./docs/README.md)** - For detailed architecture diagrams, technical reference, and developer guides, see the [docs folder](./docs/).

## 🏗️ What is Casper's Kitchens?

Casper's Kitchens is a fully functional ghost kitchen business running entirely on the Databricks platform. As a ghost kitchen, Casper's operates multiple compact commercial kitchens in shared locations, hosting restaurant vendors as tenants who create digital brands to serve diverse cuisines from single kitchen spaces.
Expand Down Expand Up @@ -128,3 +142,10 @@ Run `destroy.ipynb` to remove all Casper's Kitchens resources from your workspac

| library | description | license | source |
|----------------------------------------|-------------------------|------------|-----------------------------------------------------|
| LangChain | AI agent framework | MIT | https://github.com/langchain-ai/langchain |
| FastAPI | Web framework | MIT | https://github.com/tiangolo/fastapi |
| MLflow | Model tracking | Apache 2.0 | https://github.com/mlflow/mlflow |
| SQLAlchemy | Database ORM | MIT | https://github.com/sqlalchemy/sqlalchemy |
| psycopg | PostgreSQL adapter | LGPL-3.0 | https://github.com/psycopg/psycopg |
| Databricks SDK | Databricks API client | Apache 2.0 | https://github.com/databricks/databricks-sdk-py |
| Uvicorn | ASGI server | BSD-3 | https://github.com/encode/uvicorn |
120 changes: 120 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Casper's Kitchens - Documentation

This documentation provides comprehensive guidance for understanding and working with the Casper's Kitchens ghost kitchen data platform.

## 📋 Documentation Overview

### 🎯 [Dataflow Architecture Diagram](./dataflow-diagram.md)
Complete visual overview of the data architecture showing:
- Event sources and data ingestion
- Medallion architecture layers (Bronze → Silver → Gold)
- Applications and consumption patterns
- Data lineage and dependencies

### 🔧 [Technical Reference](./technical-reference.md)
Detailed technical specifications including:
- Complete table schemas and data types
- Transformation logic and SQL implementations
- Configuration parameters and settings

### 👨‍💻 [Developer Onboarding Guide](./developer-onboarding.md)
Step-by-step guide for new developers covering:
- Architecture overview and key concepts
- Essential files and code walkthrough
- Common development tasks and patterns
- SQL queries for monitoring and validation

### 🎨 Visual Dataflow Diagrams
Complete dataflow visualization available in multiple formats:
- **[PNG Image](./images/dataflow-diagram.png)** - Standard resolution with dark theme
- **[High-Res PNG](./images/dataflow-diagram-hd.png)** - High resolution for presentations
- **[SVG Vector](./images/dataflow-diagram.svg)** - Scalable vector format
- **[Mermaid Source](./dataflow-diagram.mermaid)** - Source code for modifications

## 🚀 Quick Navigation

### For New Developers
1. Start with the [Developer Onboarding Guide](./developer-onboarding.md)
2. Review the [Dataflow Architecture](./dataflow-diagram.md)
3. Reference the [Technical Specifications](./technical-reference.md) as needed

### For Data Engineers
1. Examine the [Technical Reference](./technical-reference.md) for implementation details
2. Use the [Dataflow Diagram](./dataflow-diagram.md) to understand data lineage
3. Follow the [Developer Guide](./developer-onboarding.md) for common tasks

### For Architects
1. Review the [Dataflow Architecture](./dataflow-diagram.md) for system design
2. Check the [Technical Reference](./technical-reference.md) for scalability details
3. Use the [Mermaid Diagram](./dataflow-diagram.mermaid) for presentations

## 🏗️ Architecture Summary

Casper's Kitchens implements a modern data platform with:

- **Real-time Event Processing**: CloudFiles streaming from ghost kitchen operations
- **Medallion Architecture**: Bronze → Silver → Gold data layers with Delta Live Tables
- **Streaming Intelligence**: ML-powered refund recommendations using LLMs
- **Operational Applications**: FastAPI web apps backed by Lakebase PostgreSQL
- **Business Intelligence**: Real-time dashboards and analytics

## 📊 Key Components

| Component | Purpose | Technology |
|-----------|---------|------------|
| Event Sources | Ghost kitchen operations | JSON events, GPS tracking |
| Bronze Layer | Raw event storage | Delta Live Tables, CloudFiles |
| Silver Layer | Clean operational data | Spark streaming, schema enforcement |
| Gold Layer | Business intelligence | Aggregations, time-series data |
| Streaming ML | Real-time recommendations | LLM integration, Spark streaming |
| Lakebase | Operational database | PostgreSQL, continuous sync |
| Applications | Human interfaces | FastAPI, React, REST APIs |

## 🔄 Data Flow Pattern

```
Ghost Kitchens → Events → Volume → Bronze → Silver → Gold → Apps
Dimensional Data (Parquet)
Streaming Intelligence (ML)
Lakebase (PostgreSQL)
```

## 📈 Business Metrics

The platform tracks key business metrics including:

- **Order Performance**: Revenue, item counts, delivery times
- **Brand Analytics**: Sales by brand, menu performance
- **Location Intelligence**: Hourly performance by ghost kitchen
- **Operational Efficiency**: Refund rates, customer satisfaction
- **Real-time Monitoring**: Live order tracking, driver performance

## 🛠️ Development Workflow

1. **Understand**: Review architecture and data flow
2. **Explore**: Examine key code files and notebooks
3. **Develop**: Make changes to transformations or applications
4. **Test**: Validate using SQL queries and application UI
5. **Deploy**: Use pipeline orchestration for production changes
6. **Monitor**: Track performance and data quality

## 📚 Additional Resources

- **Main README**: `../README.md` - Project overview and quick start
- **Code Examples**: All notebooks include detailed comments
- **Configuration**: `../data/generator/configs/` - Simulation parameters
- **Applications**: `../apps/` - Web application source code
- **Pipelines**: `../pipelines/` - Data transformation logic

## 🤝 Contributing

When contributing to the documentation:

1. Keep diagrams and technical details in sync with code changes
2. Update the developer onboarding guide for new features
3. Maintain consistency in terminology and formatting
4. Test all code examples and SQL queries
5. Update the visual diagram when architecture changes
226 changes: 226 additions & 0 deletions docs/dataflow-diagram.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Casper's Kitchens - Dataflow Architecture Diagram

## Overview

This document provides a comprehensive view of the Casper's Kitchens data architecture, showing how data flows from ghost kitchen operations through the medallion architecture to applications and dashboards.

## Visual Dataflow Diagram

![Casper's Kitchens Dataflow Architecture](./images/dataflow-diagram.png)

*Complete dataflow architecture showing event sources, medallion layers, streaming intelligence, and applications*

> **Note**: For high-resolution version suitable for presentations, see [dataflow-diagram-hd.png](./images/dataflow-diagram-hd.png). Vector version available as [dataflow-diagram.svg](./images/dataflow-diagram.svg).

## Architecture Layers

### Event Sources Layer
The system ingests real-time events from ghost kitchen operations:

- **Order Creation**: Customer app generates order events
- **Kitchen Events**: Cooking status updates (started, finished, ready)
- **Driver Events**: Pickup, delivery, and GPS tracking events
- **GPS Tracking**: Real-time location updates during delivery

### Raw Data Ingestion Layer
Events are captured and stored in raw format:

- **Volume Storage**: `/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}`
- **Event Types**: 7 distinct event types covering full order lifecycle
- **Format**: JSON files streamed via CloudFiles
- **Frequency**: Real-time streaming with configurable batch processing

### Bronze Layer - Raw Event Store
Raw events are ingested into the lakehouse:

```sql
-- Table: all_events
-- Purpose: Raw JSON events as ingested (one file per event)
-- Source: CloudFiles streaming from volumes
-- Schema: Raw JSON with event metadata
```

**Key Fields**:
- `event_type`: Type of event (order_created, gk_started, etc.)
- `order_id`: Unique order identifier
- `ts`: Event timestamp
- `body`: JSON payload with event-specific data
- `location`: Ghost kitchen location
- `gk_id`: Ghost kitchen identifier

### Silver Layer - Clean Operational Data
Events are processed and normalized:

```sql
-- Table: silver_order_items
-- Purpose: One row per item per order, with extended_price
-- Partitioned by: order_day
-- Processing: Explodes order items, adds calculated fields
```

**Key Transformations**:
- Explode order items from arrays
- Calculate `extended_price = price * qty`
- Parse customer location data
- Add temporal partitioning
- Enforce data types and schemas

**Key Fields**:
- `order_id`, `gk_id`, `location`
- `order_ts`: Canonical event timestamp
- `item_id`, `menu_id`, `category_id`, `brand_id`
- `item_name`, `price`, `qty`, `extended_price`
- `order_day`: Partition key

### Gold Layer - Business Intelligence
Aggregated tables for analytics and reporting:

#### gold_order_header
```sql
-- Purpose: Per-order revenue & counts
-- Aggregation: Group by order
-- Metrics: Total revenue, item counts, brand diversity
```

#### gold_item_sales_day
```sql
-- Purpose: Item-level units & revenue by day
-- Partitioned by: day
-- Metrics: Units sold, gross revenue per item
```

#### gold_brand_sales_day
```sql
-- Purpose: Brand-level orders (approx), items, revenue by day
-- Partitioned by: day
-- Processing: Stream-safe with HyperLogLog for order counting
-- Watermark: 3 hours for late-arriving data
```

#### gold_location_sales_hourly
```sql
-- Purpose: Hourly orders (approx) & revenue per location
-- Partitioned by: hour_ts
-- Frequency: Real-time with 3-hour watermark
-- Metrics: Approximate order counts, revenue by location/hour
```

### Dimensional Data
Static reference data loaded from parquet files:

- **brands.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.brands`
- **categories.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.categories`
- **items.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.items`
- **menus.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.menus`

### Real-time Streaming Intelligence

#### Refund Recommender Stream
```sql
-- Source: {CATALOG}.lakeflow.all_events
-- Filter: event_type = 'delivered'
-- Processing: ML-based refund scoring using LLM
-- Output: {CATALOG}.recommender.refund_recommendations
```

**Processing Logic**:
- Filters delivered orders
- Applies sampling (10% historical, 100% new data)
- Calls LLM agent for refund classification
- Outputs structured recommendations

### Lakebase Integration
PostgreSQL instance for operational applications:

- **Instance**: `{CATALOG}refundmanager`
- **Database**: `caspers`
- **Synced Table**: `pg_recommendations`
- **Sync Policy**: Continuous from `refund_recommendations`
- **Primary Key**: `order_id`

### Applications Layer

#### Refund Manager App
FastAPI application for human review:

- **Database**: PostgreSQL via Lakebase
- **Tables**:
- `refunds.refund_decisions` (decisions made by humans)
- `recommender.pg_recommendations` (AI recommendations)
- **Features**:
- View AI recommendations
- Apply refund decisions
- Track decision history
- Order event timeline

#### AI/BI Dashboards
Real-time analytics and monitoring:

- **Data Sources**: Gold layer tables
- **Metrics**: Revenue, order volumes, delivery performance
- **Refresh**: Real-time streaming updates

#### Agent Bricks
AI-powered refund decision agent:

- **Model**: LLM-based classification
- **Input**: Order delivery performance data
- **Output**: Refund recommendations (none/partial/full)
- **Integration**: Embedded in streaming pipeline

## Data Flow Summary

```
Event Sources → Raw Volume → Bronze (all_events) → Silver (silver_order_items) → Gold Tables
Dimensional Data ────────────────────────────────────────────────────────────→ Applications
Streaming Intelligence ←─────────────────────────────────────────────────────→ Lakebase
```

## Key Technical Details

### Streaming Configuration
- **Watermarks**: 3 hours for late-arriving data
- **Checkpointing**: Managed by Delta Live Tables
- **Partitioning**: By date/hour for optimal query performance
- **Approximate Aggregations**: HyperLogLog for stream-safe distinct counts

### Data Quality
- **Schema Enforcement**: Structured schemas for all silver/gold tables
- **Data Validation**: Check constraints on critical fields
- **Error Handling**: Robust JSON parsing with fallback values

### Scalability
- **Partitioning Strategy**: Time-based partitioning for all fact tables
- **Streaming**: Auto-scaling with Delta Live Tables
- **Storage**: Delta format with optimized file sizes

## Developer Onboarding

### Key Files to Understand
1. `pipelines/order_items/transformations/transformation.py` - Core data transformations
2. `stages/raw_data.ipynb` - Data generation and ingestion setup
3. `stages/lakeflow.ipynb` - Pipeline orchestration
4. `apps/refund-manager/app/main.py` - Application layer

### Getting Started
1. Review the event types and their schemas in the README
2. Understand the medallion architecture layers
3. Examine the transformation logic in `transformation.py`
4. Explore the streaming components and applications
5. Run the demo using the "Casper's Initializer" job

### Common Queries
```sql
-- View recent orders
SELECT * FROM {CATALOG}.lakeflow.silver_order_items
WHERE order_day >= CURRENT_DATE - 1;

-- Check gold layer metrics
SELECT * FROM {CATALOG}.lakeflow.gold_brand_sales_day
WHERE day = CURRENT_DATE;

-- Monitor streaming health
DESCRIBE HISTORY {CATALOG}.lakeflow.all_events;
```
Loading