This backend implements a bidirectional streaming architecture for real-time voice interaction with ElevenLabs agents.
┌─────────────┐
│ Frontend │
│ (Browser) │
└──────┬──────┘
│ WebSocket (bidirectional)
│ Audio chunks (streaming)
▼
┌─────────────────────────────────┐
│ Backend (FastAPI) │
│ ┌───────────────────────────┐ │
│ │ WebSocket Handler │ │
│ │ - Receives audio chunks │ │
│ │ - Manages connections │ │
│ └───────────┬───────────────┘ │
│ │ │
│ ┌───────────▼───────────────┐ │
│ │ ElevenLabs Service │ │
│ │ - Streams audio chunks │ │
│ │ - Processes responses │ │
│ └───────────┬───────────────┘ │
└──────────────┼───────────────────┘
│ HTTP Streaming
│ (audio chunks)
▼
┌─────────────────────────────────┐
│ ElevenLabs Agent API │
│ ┌───────────────────────────┐ │
│ │ 1. Speech-to-Text (STT) │ │
│ │ 2. LLM Processing │ │
│ │ 3. Text-to-Speech (TTS) │ │
│ └───────────────────────────┘ │
└─────────────────────────────────┘
Frontend → Backend: WebSocket connection request
Backend → Frontend: {"type": "connection_established"}
Backend → ElevenLabs: Initialize conversation
Backend → Frontend: {"type": "conversation_started", "conversation_id": "..."}
Frontend → Backend: {"type": "audio_chunk", "data": "<base64>", "format": "audio/webm"}
Backend → ElevenLabs: Stream audio chunk (HTTP streaming)
ElevenLabs: Processes audio (STT → LLM → TTS)
ElevenLabs → Backend: Stream response chunks (audio/text)
Backend → Frontend: {"type": "audio_response", "data": "<base64>", "text": "...", "is_final": false}
Frontend → Backend: {"type": "end_conversation"}
Backend → ElevenLabs: End conversation
Backend → Frontend: {"type": "conversation_ended"}
- Manages WebSocket connections
- Handles bidirectional message passing
- Processes audio chunks and forwards to ElevenLabs
- Streams responses back to frontend
- Manages conversation lifecycle
- Streams audio chunks to ElevenLabs API
- Processes streaming responses
- Handles errors and reconnection
- Tracks active connections
- Handles connection lifecycle
- Provides broadcast capabilities
{
"type": "audio_chunk",
"data": "<base64_encoded_audio_data>",
"format": "audio/webm"
}{
"type": "end_conversation"
}{
"type": "heartbeat"
}{
"type": "connection_established",
"message": "Connected to voice streaming service"
}{
"type": "conversation_started",
"conversation_id": "conv_abc123"
}{
"type": "audio_response",
"data": "<base64_encoded_audio>",
"text": "Transcribed text from STT",
"is_final": false
}{
"type": "text_response",
"text": "LLM response text",
"is_final": true
}{
"type": "error",
"message": "Error description"
}- Frontend sends audio in small chunks (e.g., 100ms chunks)
- Backend buffers chunks if needed
- Streams to ElevenLabs as received
- ElevenLabs streams responses incrementally
- Backend forwards chunks immediately to frontend
is_finalflag indicates when response is complete
- Connection errors: Attempt reconnection
- API errors: Forward error to frontend
- Timeout handling: Close connection gracefully
- Low Latency: Streaming reduces end-to-end latency
- Memory Efficiency: Process chunks instead of buffering entire audio
- Scalability: Each WebSocket connection is independent
- Error Recovery: Graceful degradation on errors
- API Key Protection: Store in environment variables
- CORS Configuration: Restrict origins
- Rate Limiting: Implement per-connection limits
- Input Validation: Validate audio format and size
- Connection Pooling: Reuse ElevenLabs connections
- Audio Compression: Compress audio before sending
- Caching: Cache common LLM responses
- Monitoring: Add metrics and logging
- Load Balancing: Distribute connections across instances