Epic: Order Accuracy for Dine-In Service with VLMs

> **⚠️ Assumption Requiring Validation with @FarhaanMohideen:**
> This epic assumes a **single-plate, staff-triggered capture model** where the staff member selects an order from the KDS/tablet interface to associate the plate, and that selection action triggers the image capture. This eliminates the need for CV-based frame selection or automated plate detection.
>
> **Future consideration (not in scope for this revision):** Multi-plate scenarios where multiple plates for a single table order are placed on a larger plating surface simultaneously. That scenario might require CV to detect and crop individual plates before sending each to the VLM for classification. This would be a separate pipeline variant.

## Overview
We build these workloads to demonstrate real-world scenarios and enable performance measurements of those workloads on different hardware platforms. They are released as open source so that partners can look at the code, the metrics instrumentation, and run the metrics themselves to replicate results or test their own components (models, pipelines, business logic) running on the solution. This epic focuses on extending the Order Accuracy pipeline to support dine-in restaurant scenarios where food is plated and validated before being served to customers, using a simplified image-based input model triggered by staff order selection.

## Use Case Description

As a restaurant operations manager, I want an AI-powered order validation system that verifies plated food items against customer orders before servers deliver them to tables, so that I can reduce order errors, minimize food waste from incorrect plates, and improve customer satisfaction without requiring staff to manually cross-reference every dish against tickets.

The system should handle the dine-in validation workflow where staff selects an order from the KDS/tablet interface, triggering image capture of the plated food for validation against the order ticket. The system leverages Vision Language Models (e.g., Qwen2.5-VL-7B or Qwen2.5-VL-3B) for whole-plate item classification, enabling rapid deployment across diverse full-service restaurant environments without custom model training.

## Technical Assumptions

### Hardware Configuration (Plating Station)
- **Validation Platform Specifications**:
  - Intel-based processing unit (integrated or external)
  - Single overhead camera positioned for comprehensive plate coverage
  - Designated plating/expo area for image capture
  - Optional: Tablet or display for order ticket visibility
  - Integration capabilities with existing POS/kitchen display systems
  - Local storage for model artifacts and image caching
- **Network Connectivity**:
  - Primary operation without network dependencies
  - Optional POS/KDS integration for order manifest retrieval
  - Optional cloud connectivity for model updates

### Real-World Latency Considerations
In dine-in restaurant environments, expo staff validate plates in rapid succession during service rushes. The physical workflow for each plate involves:
1. Tap order on KDS/tablet (triggers capture)
2. Pick up plate (~0.5 sec)
3. Carry plate to staging area (~1-1.5 sec)
4. Set plate down (~0.5 sec)
5. Turn back to plating area (~0.5 sec)
6. Glance at screen for result

This natural physical motion takes approximately **2-3 seconds per plate**. If AI inference completes within this window, the result appears by the time staff glances at the screen - creating **zero perceived wait time**. If inference exceeds this window, staff stand idle waiting for results, creating operational friction.

**Operational Assumption**: End-to-end reconciliation (image capture to match/mismatch result) should complete within approximately **2 seconds** on modern Intel processors (MTL, LNL, ARL) to align with natural plate handling cycle time. This target is grounded in operational workflow, not arbitrary performance goals.

Performance benchmarking should measure and report latency against this operational assumption. Results significantly exceeding this threshold should be flagged as potentially unsuitable for production deployment, even if other metrics (throughput, resource utilization) appear favorable.

### Software Components
- **Image Capture Module**: Single-frame capture triggered by staff order selection, with basic preprocessing for inference
- **Vision Language Model Pipeline**: Pre-trained VLM (e.g., Qwen2.5-VL-7B, Qwen2.5-VL-3B) for whole-plate food item classification without custom training
- **AI Agent Framework**: Reconciliation agent comparing detected items against order ticket data using embeddings-based semantic matching (not string comparison)
- **POS/KDS Integration Module**: Communication interface for order manifest and table/ticket data

### Key Architectural Difference from Packing Station Pipeline
The dine-in pipeline uses a **staff-triggered capture model** rather than continuous video processing:
- Order selection from KDS/tablet serves as both order association AND capture trigger
- No CV-based frame selection, plate detection, or OCR required
- No continuous frame scoring or hand detection filtering
- Simplified pipeline: **Order Select → Capture → VLM Classify → Agent Validate → Display Result**
- Significantly lower computational overhead per validation event
- Higher throughput potential (validate next plate while previous result displays)

### Performance Metrics
- **VLM Performance**:
  - Time to First Token (TTFT) for classification
  - Total classification inference latency
  - Token throughput (tokens/second)
- **AI Agent Performance**:
  - Reconciliation processing time (time for semantic comparison and match/mismatch determination)
- **System-wide Metrics**:
  - End-to-end validation latency (see Real-World Latency Considerations for operational context)
  - Plates validated per minute throughput
  - Resource utilization (CPU/GPU/NPU) across pipeline components

## AI Pipeline Components
*Illustrative examples of AI pipelines needed for this type of solution*

- **Image Capture Pipeline**: Single-frame acquisition from plating station camera triggered by staff order selection, with basic preprocessing for optimal inference
- **VLM Classification Pipeline**: Whole-plate image analysis using pre-trained Vision Language Models (e.g., Qwen2.5-VL-7B) to identify all plated food items in context
- **Order Reconciliation Pipeline**: AI agent comparing VLM-detected items against POS/KDS order manifest using embeddings-based semantic matching
- **Result Display Pipeline**: Real-time feedback to expo staff indicating match/mismatch status with specific discrepancy details
- **Performance Measurement Instrumentation**: Comprehensive metrics collection across all pipeline components

## Data Output Requirements
The system will generate structured data outputs for each plate validation. Illustrative examples:

### VLM Classification Output
```json
{
  "classification_id": "unique_identifier",
  "timestamp": "ISO8601_timestamp",
  "station_id": "expo_station_01",
  "table_number": "12",
  "ticket_id": "TKT-2025-001234",
  "image_metadata": {
    "resolution": "1920x1080",
    "capture_trigger": "order_selection"
  },
  "classified_items": [
    {
      "item_description": "Grilled Salmon with Lemon Butter",
      "item_type": "entree",
      "visual_observations": ["appears well-done", "garnish present", "lemon wedge on side"],
      "confidence": 0.92
    },
    {
      "item_description": "Roasted Asparagus",
      "item_type": "side",
      "visual_observations": ["grilled marks visible", "approximately 6 spears"],
      "confidence": 0.89
    },
    {
      "item_description": "Mashed Potatoes",
      "item_type": "side",
      "visual_observations": ["butter pat on top"],
      "confidence": 0.91
    }
  ],
  "processing_metrics": {
    "time_to_first_token_ms": 120,
    "total_inference_time_ms": 380,
    "tokens_per_second": 25.5,
    "items_classified": 3
  }
}
```

### Order Reconciliation Output
```json
{
  "reconciliation_id": "unique_identifier",
  "timestamp": "ISO8601_timestamp",
  "table_number": "12",
  "ticket_id": "TKT-2025-001234",
  "order_manifest": {
    "items_ordered": [
      {"item": "Grilled Salmon with Lemon Butter", "quantity": 1, "modifiers": ["well-done"]},
      {"item": "Roasted Asparagus", "quantity": 1, "modifiers": []},
      {"item": "Mashed Potatoes", "quantity": 1, "modifiers": ["extra butter"]}
    ],
    "total_items": 3
  },
  "detected_items": [
    {"item": "Grilled Salmon with Lemon Butter", "quantity": 1},
    {"item": "Roasted Asparagus", "quantity": 1},
    {"item": "Mashed Potatoes", "quantity": 1}
  ],
  "validation_result": {
    "order_complete": true,
    "missing_items": [],
    "extra_items": [],
    "modifier_validation": {
      "validated": ["well-done preparation detected"],
      "unable_to_verify": ["extra butter - not visually detectable"]
    },
    "accuracy_score": 1.0
  },
  "processing_metrics": {
    "reconciliation_time_ms": 25,
    "confidence_score": 0.94
  }
}
```

## Performance Metrics (Use Existing Performance Tools)
- **System Resource Metrics**:
  - CPU/GPU/NPU Utilization across AI pipelines
  - Memory footprint for edge model deployment
  - Power consumption patterns during inference
  - Plates per minute throughput per hardware configuration

## Acceptance Criteria
1. **Staff-Triggered Capture Model**: System captures plate images when staff selects an order from the KDS/tablet interface, demonstrating the simplified input model distinct from continuous video processing

2. **Zero-Training Deployment**: System utilizes pre-trained VLM models (e.g., Qwen2.5-VL-7B) for food item classification without requiring custom training, demonstrating immediate deployment capability across diverse restaurant menus

3. **Whole-Plate VLM Analysis**: System processes entire plate images through the VLM to classify all items in context, leveraging the model's scene understanding capabilities

4. **Semantic Order Matching**: AI agent performs embeddings-based semantic comparison between VLM-detected items and order manifest, handling variations in item naming and descriptions

5. **Latency-Aware Benchmarking**: Performance measurements include end-to-end latency (capture to result) as a primary reported metric, with clear documentation of latency characteristics across hardware configurations. Results are contextualized against the ~2 second operational assumption for restaurant deployments on modern Intel processors (MTL, LNL, ARL, PTL). See Real-World Latency Considerations for operational context.

6. **POS/KDS Integration Capability**: Implementation showcases integration patterns for retrieving order manifests from kitchen display systems or POS platforms

7. **Performance Benchmarking Integration**: Code instrumented to enable performance measurements on different Intel hardware platforms using existing performance tools framework

8. **Full-Service Restaurant Applicability**: System handles diverse plated presentations from casual dining to fine dining environments, demonstrating broad applicability beyond QSR scenarios

9. **Edge Processing Optimization**: All AI processing runs locally on Intel hardware with configuration options for different platforms (CPU-only, iGPU/dGPU acceleration, NPU utilization)

10. **Hardware Sizing Guidance**: Implementation provides clear performance metrics (plates/minute throughput, latency) enabling hardware selection and capacity planning for various restaurant volumes

11. **Rapid Deployment Capability**: Repository designed for rapid onboarding - clone, build, deploy within 15 minutes including model artifacts

12. **Comprehensive Documentation**: GitHub Pages documentation covering dine-in validation workflow, AI pipeline architecture, hardware requirements, and performance benchmarking methodology

13. **Demo Scenarios**: Functional demonstration via live setup or pre-recorded scenarios showing complete plate validation workflow from order selection to verification result

## Business Logic Assumed (and intentionally omitted)
1. **Menu Management Integration**: Dynamic menu updates, daily specials, and seasonal item changes
2. **Staff Training and Workflow Integration**: Server notification systems, training protocols, and operational procedure modifications
3. **Customer Communication**: Table-side communication of order issues and service recovery procedures
4. **Expeditor Workflow Optimization**: Queue management, order prioritization, and multi-plate coordination
5. **Quality Control Analytics**: Long-term accuracy trending, kitchen performance monitoring, and plating consistency reporting
6. **Multi-Station Coordination**: Coordinating validation across multiple expo stations in high-volume environments
7. **Modifier Verification Limitations**: Visual detection cannot verify all preparation modifiers (e.g., "no salt", "dressing on side" when plated together)
8. **Allergy Alert Integration**: Critical allergy flagging requires integration with order management systems beyond visual validation
9. **Temperature Verification**: Food temperature compliance requires additional sensor integration beyond visual AI
10. **Portion Control Validation**: Precise portion weight verification requires scale integration

## Technical Requirements
- **Edge-Optimized VLM Deployment**: Pre-trained Vision Language Models (e.g., Qwen2.5-VL-7B, Qwen2.5-VL-3B) optimized using Intel tools (OpenVINO, Neural Compressor) for maximum performance on restaurant hardware
- **Local Processing Architecture**: Complete AI inference stack running on Intel platforms without cloud dependencies for reliable operation
- **POS/KDS Integration Patterns**: Flexible integration examples for major restaurant management systems for order manifest retrieval
- **Image Capture Configuration**: Camera positioning and lighting guidance for optimal inference accuracy in kitchen environments
- **Hardware Configuration Support**: Scalable deployment across various Intel hardware specifications (MTL, LNL, PTL) with performance characterization
- **Model Artifact Management**: Efficient VLM loading optimized for single-image inference patterns

## Notes
- **Staff-Triggered vs Continuous Processing**: This pipeline specifically addresses the dine-in use case where staff interaction (order selection) naturally provides both order association and capture trigger, eliminating need for CV-based frame selection or automated detection
- **Architectural Simplification**: The staff-triggered model removes the need for OCR, frame scoring, hand detection, and continuous tracking components present in the packing station pipeline
- **Whole-Plate VLM Approach**: Sending the entire plate image to the VLM preserves context (e.g., understanding "salmon WITH asparagus and potatoes") and requires only a single inference call per plate
- **Full-Service Focus**: Targets full-service restaurants where plates are prepared in kitchen and validated at expo before server delivery, distinct from QSR packing workflows
- **Performance Benchmarking Priority**: Primary goal is measuring real-world AI workload performance on Intel hardware platforms with reproducible results
- **Partner Enablement**: Repository designed for ISVs to benchmark their own VLM configurations, menu setups, and hardware platforms
- **Complementary to Packing Station**: This pipeline addresses a distinct workflow and can be deployed alongside or independently of the video-based packing station pipeline
- **Market Validation**: Based on industry demand for order accuracy solutions extending beyond QSR into casual and fine dining segments

This epic represents an extension of the Order Accuracy portfolio to address full-service restaurant environments, demonstrating Intel edge computing capabilities for simplified image-based AI validation workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Epic: Order Accuracy for Dine-In Service with VLMs #57

Overview

Use Case Description

Technical Assumptions

Hardware Configuration (Plating Station)

Real-World Latency Considerations

Software Components

Key Architectural Difference from Packing Station Pipeline

Performance Metrics

AI Pipeline Components

Data Output Requirements

VLM Classification Output

Order Reconciliation Output

Performance Metrics (Use Existing Performance Tools)

Acceptance Criteria

Business Logic Assumed (and intentionally omitted)

Technical Requirements

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Epic: Order Accuracy for Dine-In Service with VLMs #57

Description

Overview

Use Case Description

Technical Assumptions

Hardware Configuration (Plating Station)

Real-World Latency Considerations

Software Components

Key Architectural Difference from Packing Station Pipeline

Performance Metrics

AI Pipeline Components

Data Output Requirements

VLM Classification Output

Order Reconciliation Output

Performance Metrics (Use Existing Performance Tools)

Acceptance Criteria

Business Logic Assumed (and intentionally omitted)

Technical Requirements

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions