Skip to content

Latest commit

 

History

History
202 lines (168 loc) · 5.72 KB

File metadata and controls

202 lines (168 loc) · 5.72 KB

Architecture Documentation

Streaming Architecture Overview

This backend implements a bidirectional streaming architecture for real-time voice interaction with ElevenLabs agents.

Architecture Diagram

┌─────────────┐
│  Frontend   │
│  (Browser)  │
└──────┬──────┘
       │ WebSocket (bidirectional)
       │ Audio chunks (streaming)
       ▼
┌─────────────────────────────────┐
│      Backend (FastAPI)          │
│  ┌───────────────────────────┐  │
│  │  WebSocket Handler        │  │
│  │  - Receives audio chunks  │  │
│  │  - Manages connections    │  │
│  └───────────┬───────────────┘  │
│              │                   │
│  ┌───────────▼───────────────┐  │
│  │  ElevenLabs Service       │  │
│  │  - Streams audio chunks   │  │
│  │  - Processes responses    │  │
│  └───────────┬───────────────┘  │
└──────────────┼───────────────────┘
               │ HTTP Streaming
               │ (audio chunks)
               ▼
┌─────────────────────────────────┐
│    ElevenLabs Agent API         │
│  ┌───────────────────────────┐  │
│  │  1. Speech-to-Text (STT)  │  │
│  │  2. LLM Processing        │  │
│  │  3. Text-to-Speech (TTS)  │  │
│  └───────────────────────────┘  │
└─────────────────────────────────┘

Data Flow

1. Connection Establishment

Frontend → Backend: WebSocket connection request
Backend → Frontend: {"type": "connection_established"}
Backend → ElevenLabs: Initialize conversation
Backend → Frontend: {"type": "conversation_started", "conversation_id": "..."}

2. Audio Streaming (Frontend → ElevenLabs)

Frontend → Backend: {"type": "audio_chunk", "data": "<base64>", "format": "audio/webm"}
Backend → ElevenLabs: Stream audio chunk (HTTP streaming)
ElevenLabs: Processes audio (STT → LLM → TTS)

3. Response Streaming (ElevenLabs → Frontend)

ElevenLabs → Backend: Stream response chunks (audio/text)
Backend → Frontend: {"type": "audio_response", "data": "<base64>", "text": "...", "is_final": false}

4. Connection Cleanup

Frontend → Backend: {"type": "end_conversation"}
Backend → ElevenLabs: End conversation
Backend → Frontend: {"type": "conversation_ended"}

Key Components

1. WebSocket Handler (src/routes/voice_router.py)

  • Manages WebSocket connections
  • Handles bidirectional message passing
  • Processes audio chunks and forwards to ElevenLabs
  • Streams responses back to frontend

2. ElevenLabs Service (src/services/elevenlabs_service.py)

  • Manages conversation lifecycle
  • Streams audio chunks to ElevenLabs API
  • Processes streaming responses
  • Handles errors and reconnection

3. WebSocket Manager (src/services/websocket_manager.py)

  • Tracks active connections
  • Handles connection lifecycle
  • Provides broadcast capabilities

Message Protocol

Client → Server Messages

Audio Chunk

{
  "type": "audio_chunk",
  "data": "<base64_encoded_audio_data>",
  "format": "audio/webm"
}

End Conversation

{
  "type": "end_conversation"
}

Heartbeat

{
  "type": "heartbeat"
}

Server → Client Messages

Connection Established

{
  "type": "connection_established",
  "message": "Connected to voice streaming service"
}

Conversation Started

{
  "type": "conversation_started",
  "conversation_id": "conv_abc123"
}

Audio Response

{
  "type": "audio_response",
  "data": "<base64_encoded_audio>",
  "text": "Transcribed text from STT",
  "is_final": false
}

Text Response (if text-only)

{
  "type": "text_response",
  "text": "LLM response text",
  "is_final": true
}

Error

{
  "type": "error",
  "message": "Error description"
}

Streaming Strategy

Audio Chunking

  • Frontend sends audio in small chunks (e.g., 100ms chunks)
  • Backend buffers chunks if needed
  • Streams to ElevenLabs as received

Response Handling

  • ElevenLabs streams responses incrementally
  • Backend forwards chunks immediately to frontend
  • is_final flag indicates when response is complete

Error Handling

  • Connection errors: Attempt reconnection
  • API errors: Forward error to frontend
  • Timeout handling: Close connection gracefully

Performance Considerations

  1. Low Latency: Streaming reduces end-to-end latency
  2. Memory Efficiency: Process chunks instead of buffering entire audio
  3. Scalability: Each WebSocket connection is independent
  4. Error Recovery: Graceful degradation on errors

Security Considerations

  1. API Key Protection: Store in environment variables
  2. CORS Configuration: Restrict origins
  3. Rate Limiting: Implement per-connection limits
  4. Input Validation: Validate audio format and size

Future Enhancements

  1. Connection Pooling: Reuse ElevenLabs connections
  2. Audio Compression: Compress audio before sending
  3. Caching: Cache common LLM responses
  4. Monitoring: Add metrics and logging
  5. Load Balancing: Distribute connections across instances