databricks-solutions · dmatrix · Oct 3, 2025 · Oct 4, 2025 · Oct 4, 2025 · Oct 4, 2025
diff --git a/README.md b/README.md
@@ -1,5 +1,17 @@
 # 🍔 Casper's Kitchens
 
+[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=flat&logo=python&logoColor=white)](https://www.python.org/)
+[![PySpark](https://img.shields.io/badge/PySpark-3.5+-E25A1C?style=flat&logo=apache-spark&logoColor=white)](https://spark.apache.org/)
+[![Databricks](https://img.shields.io/badge/Databricks-Platform-FF3621?style=flat&logo=databricks&logoColor=white)](https://www.databricks.com/)
+[![Delta Lake](https://img.shields.io/badge/Delta_Lake-Latest-00ADD8?style=flat&logo=delta&logoColor=white)](https://delta.io/)
+[![MLflow](https://img.shields.io/badge/MLflow-Agent_Tracking-0194E2?style=flat&logo=mlflow&logoColor=white)](https://mlflow.org/)
+[![LangChain](https://img.shields.io/badge/LangChain-AI_Agent-121212?style=flat&logo=chainlink&logoColor=white)](https://www.langchain.com/)
+[![FastAPI](https://img.shields.io/badge/FastAPI-Web_App-009688?style=flat&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
+[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-Lakebase-4169E1?style=flat&logo=postgresql&logoColor=white)](https://www.postgresql.org/)
+[![License](https://img.shields.io/badge/License-Databricks-00A4EF?style=flat)](https://databricks.com/db-license-source)
+
+---
+
 Spin up a fully working ghost-kitchen business on Databricks in minutes.
 
 Casper's Kitchens is a simulated food-delivery platform that shows off the full power of Databricks: streaming ingestion, Lakeflow Declarative Pipelines, AI/BI Dashboards and Genie, Agent Bricks, and real-time apps backed by Lakebase postgres — all stitched together into one narrative.
@@ -25,6 +37,8 @@ Then open Databricks and watch:
 
 That's it! Your Casper's Kitchens environment will be up and running.
 
+> 📖 **[View Complete Documentation](./docs/README.md)** - For detailed architecture diagrams, technical reference, and developer guides, see the [docs folder](./docs/).
+
 ## 🏗️ What is Casper's Kitchens?
 
 Casper's Kitchens is a fully functional ghost kitchen business running entirely on the Databricks platform. As a ghost kitchen, Casper's operates multiple compact commercial kitchens in shared locations, hosting restaurant vendors as tenants who create digital brands to serve diverse cuisines from single kitchen spaces.
@@ -128,3 +142,10 @@ Run `destroy.ipynb` to remove all Casper's Kitchens resources from your workspac
 
 | library                                | description             | license    | source                                              |
 |----------------------------------------|-------------------------|------------|-----------------------------------------------------|
+| LangChain                              | AI agent framework      | MIT        | https://github.com/langchain-ai/langchain           |
+| FastAPI                                | Web framework           | MIT        | https://github.com/tiangolo/fastapi                 |
+| MLflow                                 | Model tracking          | Apache 2.0 | https://github.com/mlflow/mlflow                    |
+| SQLAlchemy                             | Database ORM            | MIT        | https://github.com/sqlalchemy/sqlalchemy            |
+| psycopg                                | PostgreSQL adapter      | LGPL-3.0   | https://github.com/psycopg/psycopg                  |
+| Databricks SDK                         | Databricks API client   | Apache 2.0 | https://github.com/databricks/databricks-sdk-py     |
+| Uvicorn                                | ASGI server             | BSD-3      | https://github.com/encode/uvicorn                   |
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,120 @@
+# Casper's Kitchens - Documentation
+
+This documentation provides comprehensive guidance for understanding and working with the Casper's Kitchens ghost kitchen data platform.
+
+## 📋 Documentation Overview
+
+### 🎯 [Dataflow Architecture Diagram](./dataflow-diagram.md)
+Complete visual overview of the data architecture showing:
+- Event sources and data ingestion
+- Medallion architecture layers (Bronze → Silver → Gold)
+- Applications and consumption patterns
+- Data lineage and dependencies
+
+### 🔧 [Technical Reference](./technical-reference.md)
+Detailed technical specifications including:
+- Complete table schemas and data types
+- Transformation logic and SQL implementations
+- Configuration parameters and settings
+
+### 👨‍💻 [Developer Onboarding Guide](./developer-onboarding.md)
+Step-by-step guide for new developers covering:
+- Architecture overview and key concepts
+- Essential files and code walkthrough
+- Common development tasks and patterns
+- SQL queries for monitoring and validation
+
+### 🎨 Visual Dataflow Diagrams
+Complete dataflow visualization available in multiple formats:
+- **[PNG Image](./images/dataflow-diagram.png)** - Standard resolution with dark theme
+- **[High-Res PNG](./images/dataflow-diagram-hd.png)** - High resolution for presentations  
+- **[SVG Vector](./images/dataflow-diagram.svg)** - Scalable vector format
+- **[Mermaid Source](./dataflow-diagram.mermaid)** - Source code for modifications
+
+## 🚀 Quick Navigation
+
+### For New Developers
+1. Start with the [Developer Onboarding Guide](./developer-onboarding.md)
+2. Review the [Dataflow Architecture](./dataflow-diagram.md)
+3. Reference the [Technical Specifications](./technical-reference.md) as needed
+
+### For Data Engineers
+1. Examine the [Technical Reference](./technical-reference.md) for implementation details
+2. Use the [Dataflow Diagram](./dataflow-diagram.md) to understand data lineage
+3. Follow the [Developer Guide](./developer-onboarding.md) for common tasks
+
+### For Architects
+1. Review the [Dataflow Architecture](./dataflow-diagram.md) for system design
+2. Check the [Technical Reference](./technical-reference.md) for scalability details
+3. Use the [Mermaid Diagram](./dataflow-diagram.mermaid) for presentations
+
+## 🏗️ Architecture Summary
+
+Casper's Kitchens implements a modern data platform with:
+
+- **Real-time Event Processing**: CloudFiles streaming from ghost kitchen operations
+- **Medallion Architecture**: Bronze → Silver → Gold data layers with Delta Live Tables
+- **Streaming Intelligence**: ML-powered refund recommendations using LLMs
+- **Operational Applications**: FastAPI web apps backed by Lakebase PostgreSQL
+- **Business Intelligence**: Real-time dashboards and analytics
+
+## 📊 Key Components
+
+| Component | Purpose | Technology |
+|-----------|---------|------------|
+| Event Sources | Ghost kitchen operations | JSON events, GPS tracking |
+| Bronze Layer | Raw event storage | Delta Live Tables, CloudFiles |
+| Silver Layer | Clean operational data | Spark streaming, schema enforcement |
+| Gold Layer | Business intelligence | Aggregations, time-series data |
+| Streaming ML | Real-time recommendations | LLM integration, Spark streaming |
+| Lakebase | Operational database | PostgreSQL, continuous sync |
+| Applications | Human interfaces | FastAPI, React, REST APIs |
+
+## 🔄 Data Flow Pattern
+
+```
+Ghost Kitchens → Events → Volume → Bronze → Silver → Gold → Apps
+                                     ↓
+                         Dimensional Data (Parquet)
+                                     ↓
+                         Streaming Intelligence (ML)
+                                     ↓
+                         Lakebase (PostgreSQL)
+```
+
+## 📈 Business Metrics
+
+The platform tracks key business metrics including:
+
+- **Order Performance**: Revenue, item counts, delivery times
+- **Brand Analytics**: Sales by brand, menu performance
+- **Location Intelligence**: Hourly performance by ghost kitchen
+- **Operational Efficiency**: Refund rates, customer satisfaction
+- **Real-time Monitoring**: Live order tracking, driver performance
+
+## 🛠️ Development Workflow
+
+1. **Understand**: Review architecture and data flow
+2. **Explore**: Examine key code files and notebooks
+3. **Develop**: Make changes to transformations or applications
+4. **Test**: Validate using SQL queries and application UI
+5. **Deploy**: Use pipeline orchestration for production changes
+6. **Monitor**: Track performance and data quality
+
+## 📚 Additional Resources
+
+- **Main README**: `../README.md` - Project overview and quick start
+- **Code Examples**: All notebooks include detailed comments
+- **Configuration**: `../data/generator/configs/` - Simulation parameters
+- **Applications**: `../apps/` - Web application source code
+- **Pipelines**: `../pipelines/` - Data transformation logic
+
+## 🤝 Contributing
+
+When contributing to the documentation:
+
+1. Keep diagrams and technical details in sync with code changes
+2. Update the developer onboarding guide for new features
+3. Maintain consistency in terminology and formatting
+4. Test all code examples and SQL queries
+5. Update the visual diagram when architecture changes
diff --git a/docs/dataflow-diagram.md b/docs/dataflow-diagram.md
@@ -0,0 +1,226 @@
+# Casper's Kitchens - Dataflow Architecture Diagram
+
+## Overview
+
+This document provides a comprehensive view of the Casper's Kitchens data architecture, showing how data flows from ghost kitchen operations through the medallion architecture to applications and dashboards.
+
+## Visual Dataflow Diagram
+
+![Casper's Kitchens Dataflow Architecture](./images/dataflow-diagram.png)
+
+*Complete dataflow architecture showing event sources, medallion layers, streaming intelligence, and applications*
+
+> **Note**: For high-resolution version suitable for presentations, see [dataflow-diagram-hd.png](./images/dataflow-diagram-hd.png). Vector version available as [dataflow-diagram.svg](./images/dataflow-diagram.svg).
+
+## Architecture Layers
+
+### Event Sources Layer
+The system ingests real-time events from ghost kitchen operations:
+
+- **Order Creation**: Customer app generates order events
+- **Kitchen Events**: Cooking status updates (started, finished, ready)  
+- **Driver Events**: Pickup, delivery, and GPS tracking events
+- **GPS Tracking**: Real-time location updates during delivery
+
+### Raw Data Ingestion Layer
+Events are captured and stored in raw format:
+
+- **Volume Storage**: `/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}`
+- **Event Types**: 7 distinct event types covering full order lifecycle
+- **Format**: JSON files streamed via CloudFiles
+- **Frequency**: Real-time streaming with configurable batch processing
+
+### Bronze Layer - Raw Event Store
+Raw events are ingested into the lakehouse:
+
+```sql
+-- Table: all_events
+-- Purpose: Raw JSON events as ingested (one file per event)
+-- Source: CloudFiles streaming from volumes
+-- Schema: Raw JSON with event metadata
+```
+
+**Key Fields**:
+- `event_type`: Type of event (order_created, gk_started, etc.)
+- `order_id`: Unique order identifier
+- `ts`: Event timestamp
+- `body`: JSON payload with event-specific data
+- `location`: Ghost kitchen location
+- `gk_id`: Ghost kitchen identifier
+
+### Silver Layer - Clean Operational Data
+Events are processed and normalized:
+
+```sql
+-- Table: silver_order_items
+-- Purpose: One row per item per order, with extended_price
+-- Partitioned by: order_day
+-- Processing: Explodes order items, adds calculated fields
+```
+
+**Key Transformations**:
+- Explode order items from arrays
+- Calculate `extended_price = price * qty`
+- Parse customer location data
+- Add temporal partitioning
+- Enforce data types and schemas
+
+**Key Fields**:
+- `order_id`, `gk_id`, `location`
+- `order_ts`: Canonical event timestamp
+- `item_id`, `menu_id`, `category_id`, `brand_id`
+- `item_name`, `price`, `qty`, `extended_price`
+- `order_day`: Partition key
+
+### Gold Layer - Business Intelligence
+Aggregated tables for analytics and reporting:
+
+#### gold_order_header
+```sql
+-- Purpose: Per-order revenue & counts
+-- Aggregation: Group by order
+-- Metrics: Total revenue, item counts, brand diversity
+```
+
+#### gold_item_sales_day  
+```sql
+-- Purpose: Item-level units & revenue by day
+-- Partitioned by: day
+-- Metrics: Units sold, gross revenue per item
+```
+
+#### gold_brand_sales_day
+```sql
+-- Purpose: Brand-level orders (approx), items, revenue by day  
+-- Partitioned by: day
+-- Processing: Stream-safe with HyperLogLog for order counting
+-- Watermark: 3 hours for late-arriving data
+```
+
+#### gold_location_sales_hourly
+```sql
+-- Purpose: Hourly orders (approx) & revenue per location
+-- Partitioned by: hour_ts
+-- Frequency: Real-time with 3-hour watermark
+-- Metrics: Approximate order counts, revenue by location/hour
+```
+
+### Dimensional Data
+Static reference data loaded from parquet files:
+
+- **brands.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.brands`
+- **categories.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.categories`  
+- **items.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.items`
+- **menus.parquet** → `{CATALOG}.{SIMULATOR_SCHEMA}.menus`
+
+### Real-time Streaming Intelligence
+
+#### Refund Recommender Stream
+```sql
+-- Source: {CATALOG}.lakeflow.all_events
+-- Filter: event_type = 'delivered'
+-- Processing: ML-based refund scoring using LLM
+-- Output: {CATALOG}.recommender.refund_recommendations
+```
+
+**Processing Logic**:
+- Filters delivered orders
+- Applies sampling (10% historical, 100% new data)
+- Calls LLM agent for refund classification
+- Outputs structured recommendations
+
+### Lakebase Integration
+PostgreSQL instance for operational applications:
+
+- **Instance**: `{CATALOG}refundmanager`
+- **Database**: `caspers`
+- **Synced Table**: `pg_recommendations`
+- **Sync Policy**: Continuous from `refund_recommendations`
+- **Primary Key**: `order_id`
+
+### Applications Layer
+
+#### Refund Manager App
+FastAPI application for human review:
+
+- **Database**: PostgreSQL via Lakebase
+- **Tables**: 
+  - `refunds.refund_decisions` (decisions made by humans)
+  - `recommender.pg_recommendations` (AI recommendations)
+- **Features**: 
+  - View AI recommendations
+  - Apply refund decisions
+  - Track decision history
+  - Order event timeline
+
+#### AI/BI Dashboards
+Real-time analytics and monitoring:
+
+- **Data Sources**: Gold layer tables
+- **Metrics**: Revenue, order volumes, delivery performance
+- **Refresh**: Real-time streaming updates
+
+#### Agent Bricks
+AI-powered refund decision agent:
+
+- **Model**: LLM-based classification
+- **Input**: Order delivery performance data
+- **Output**: Refund recommendations (none/partial/full)
+- **Integration**: Embedded in streaming pipeline
+
+## Data Flow Summary
+
+```
+Event Sources → Raw Volume → Bronze (all_events) → Silver (silver_order_items) → Gold Tables
+                                                                                      ↓
+Dimensional Data ────────────────────────────────────────────────────────────→ Applications
+                                                                                      ↓
+Streaming Intelligence ←─────────────────────────────────────────────────────→ Lakebase
+```
+
+## Key Technical Details
+
+### Streaming Configuration
+- **Watermarks**: 3 hours for late-arriving data
+- **Checkpointing**: Managed by Delta Live Tables
+- **Partitioning**: By date/hour for optimal query performance
+- **Approximate Aggregations**: HyperLogLog for stream-safe distinct counts
+
+### Data Quality
+- **Schema Enforcement**: Structured schemas for all silver/gold tables
+- **Data Validation**: Check constraints on critical fields
+- **Error Handling**: Robust JSON parsing with fallback values
+
+### Scalability
+- **Partitioning Strategy**: Time-based partitioning for all fact tables
+- **Streaming**: Auto-scaling with Delta Live Tables
+- **Storage**: Delta format with optimized file sizes
+
+## Developer Onboarding
+
+### Key Files to Understand
+1. `pipelines/order_items/transformations/transformation.py` - Core data transformations
+2. `stages/raw_data.ipynb` - Data generation and ingestion setup
+3. `stages/lakeflow.ipynb` - Pipeline orchestration
+4. `apps/refund-manager/app/main.py` - Application layer
+
+### Getting Started
+1. Review the event types and their schemas in the README
+2. Understand the medallion architecture layers
+3. Examine the transformation logic in `transformation.py`
+4. Explore the streaming components and applications
+5. Run the demo using the "Casper's Initializer" job
+
+### Common Queries
+```sql
+-- View recent orders
+SELECT * FROM {CATALOG}.lakeflow.silver_order_items 
+WHERE order_day >= CURRENT_DATE - 1;
+
+-- Check gold layer metrics
+SELECT * FROM {CATALOG}.lakeflow.gold_brand_sales_day 
+WHERE day = CURRENT_DATE;
+
+-- Monitor streaming health
+DESCRIBE HISTORY {CATALOG}.lakeflow.all_events;
+```