Upload a photo Β· Clone a voice Β· Talk to any face in real time
Quick Start Β· Features Β· Architecture Β· GPU / AWS Deploy Β· API Β· Roadmap
The most complete open-source AI talking avatar system. Real-time lip-sync Β· Zero-shot voice cloning Β· Multi-LLM Β· Runs 100% locally or on AWS.
AvatarAI is an open-source, production-ready platform for building photorealistic AI avatar conversations. Upload any face photo, clone a voice from a 5-second audio clip, and have a real-time conversation β with lip-sync video generated on every single response.
[mic input] β Whisper STT β Claude / GPT-4 β XTTS v2 TTS β MuseTalk lip-sync β [video]
< 2β4 s first chunk on AWS GPU >
What makes AvatarAI different:
- π€ Zero-shot voice cloning β 5 seconds of audio is all you need (XTTS v2)
- π Any face, any language β upload a JPEG, pick from 18 languages, start talking
- β‘ Sentence-chunk streaming β first video chunk plays while the rest is still being generated
- π΄ Idle animation β avatar breathes and glows while waiting, no blank screens
- π 100% local mode β nothing leaves your machine
- π Multi-LLM β Claude, GPT-4, or Llama 3 (free, local via Ollama)
- π AWS GPU deployment β one-command deploy to
g5.xlargefor true real-time (~30 FPS) - ποΈ Production-ready β JWT auth, rate limiting, S3 storage, Terraform IaC
| Category | Details |
|---|---|
| π€ LLM Backends | Claude (Anthropic) Β· GPT-4o (OpenAI) Β· Llama 3 (Ollama, local) |
| π€ Voice Cloning | Record 5β30 s β XTTS v2 zero-shot speaker embedding |
| π£οΈ Speech-to-Text | OpenAI Whisper (faster-whisper, CUDA-accelerated), 18+ languages |
| π¬ Lip-Sync Video | MuseTalk V1.5 (30 FPS on GPU) Β· FFmpeg fallback (CPU) |
| β‘ Streaming Pipeline | Sentence chunks stream over WebSocket as they complete |
| π΄ Idle Animation | CSS breathing animation while avatar waits β no blank screens |
| π Emotion Detection | Live emotion badges per message |
| π 18+ Languages | Whisper multilingual STT + XTTS v2 multilingual TTS |
| π Local-First Storage | USE_LOCAL_STORAGE=true β no AWS needed for dev |
| π Auth & Sessions | JWT authentication, conversation history, persistent sessions |
| π Observability | Prometheus Β· Celery Flower Β· Sentry Β· structured logging |
| π§ͺ Tested | Full pytest suite β users, avatars, sessions, health checks |
| π AWS GPU Deploy | One-command g5.xlarge deploy with CUDA 11.8 + float16 |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser / Client β
β βββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
β βAvatar Studioβ β Voice Studio β β Chat Interface β β
β β (upload) β β (cloning) β β Idle anim + chunks β β
β ββββββββ¬βββββββ ββββββββ¬ββββββββ ββββββββββββ¬ββββββββββββ β
βββββββββββΌββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββ
β REST β REST β WebSocket
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WebSocket Manager β β
β β split sentences β TTS β MuseTalk β stream chunks β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββ βββββββββββββ ββββββββββββ βββββββββββββββββ β
β β Whisper β βClaude/GPT β β XTTS v2 β β MuseTalk β β
β β STT β β / Llama β β TTS β β (GPU/CPU) β β
β ββββββββββββ βββββββββββββ ββββββββββββ βββββββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββββββββ β
β βPostgreSQLβ β Redis β β Celery β β Local FS / S3 β β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[User types / speaks]
β
βΌ
Whisper STT ββββββββββββββββββΊ transcript
β
βΌ
Claude / GPT / Llama βββββββββΊ full response text
β
βΌ
Split into sentences βββββββββΊ ["Hello!", "How are you?", ...]
β
βββ sentence 1 β XTTS β MuseTalk β video_chunk WS β browser plays
βββ sentence 2 β XTTS β MuseTalk β video_chunk WS β queued
βββ sentence N β XTTS β MuseTalk β video_chunk WS β queued
ai-avatar-system/
βββ backend/ # FastAPI application
β βββ app/
β β βββ api/v1/ # REST endpoints (users, avatars, sessions, messages)
β β βββ services/ # Core services (LLM, TTS, STT, animator, storage)
β β βββ models/ # SQLAlchemy DB models
β β βββ websocket.py # Real-time WebSocket handler + sentence streaming
β βββ alembic/ # Database migrations
β βββ models/MuseTalk/ # MuseTalk V1.5 (lip-sync engine)
β β βββ scripts/
β β βββ musetalk_worker.py # Persistent worker (models loaded once)
β βββ tests/ # pytest suite
β βββ Dockerfile # CUDA 11.8 base image
β βββ requirements.txt
βββ frontend/ # Next.js 14 application
β βββ app/ # App Router pages
β βββ components/ # React components (ChatInterface, IdleAvatar, etc.)
β βββ lib/api.ts # Axios API client
β βββ store/ # Zustand global state
βββ nginx/
β βββ nginx.conf # Reverse proxy (HTTP β backend/frontend, WebSocket)
βββ infrastructure/
β βββ main.tf # AWS Terraform (ECS, RDS, ElastiCache, S3, CloudFront)
β βββ variables.tf
βββ scripts/
β βββ setup_musetalk.sh # Download MuseTalk models (~9 GB)
β βββ deploy-aws.sh # One-command EC2 GPU deployment
βββ docker-compose.yml # Development (CPU) β all services
βββ docker-compose.prod.yml # Production overrides (GPU, no bind mounts, logging)
βββ deploy.sh # ECR push + Terraform deploy (ECS path)
βββ .env.example # Development env template
βββ .env.prod.example # Production env template
- Docker & Docker Compose v2+ (recommended)
- OR: Python 3.10+, Node.js 18+, FFmpeg, PostgreSQL, Redis
git clone https://github.com/PunithVT/ai-avatar-system.git
cd ai-avatar-system
cp .env.example .env # add your ANTHROPIC_API_KEY (or OPENAI_API_KEY)
docker compose up -d| Service | URL |
|---|---|
| π₯οΈ Frontend | http://localhost:3000 |
| βοΈ Backend API | http://localhost:8000 |
| π Swagger Docs | http://localhost:8000/docs |
| πΈ Celery Flower | http://localhost:5555 |
No AWS required. Set
USE_LOCAL_STORAGE=true(default) β uploads saved tobackend/uploads/.
# Backend
cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp ../.env.example ../.env
alembic upgrade head
uvicorn main:app --reload --port 8000
# Frontend (new terminal)
cd frontend
npm install
npm run dev# Download models (~9 GB, one-time)
bash scripts/setup_musetalk.sh
# Set in .env
AVATAR_ENGINE=musetalk
# Restart
docker compose restart backendMuseTalk achieves 30 FPS at 256Γ256 on a V100-class GPU (source: MuseTalk paper). On CPU it is 30β50Γ slower. Deploying on AWS gets you genuine real-time performance.
| Instance | GPU | VRAM | Spot $/hr | MuseTalk FPS |
|---|---|---|---|---|
g4dn.xlarge |
T4 | 16 GB | ~$0.16 | ~15β20 FPS |
g5.xlarge |
A10G | 24 GB | ~$0.30 | ~30 FPS β |
g6.xlarge |
L4 | 24 GB | ~$0.24 | ~30 FPS β |
Recommended: g5.xlarge Spot (~$72/mo at 8 hrs/day).
# 1. Launch g5.xlarge with Ubuntu 22.04 LTS, SSH in, then:
bash <(curl -fsSL https://raw.githubusercontent.com/PunithVT/ai-avatar-system/main/scripts/deploy-aws.sh)
# 2. Fill in API keys:
nano /opt/ai-avatar-system/.env.prod
# 3. Redeploy with your keys:
bash /opt/ai-avatar-system/scripts/deploy-aws.sh --updateThe script automatically:
- Installs Docker + nvidia-docker2
- Verifies GPU is accessible
- Downloads MuseTalk models (~9 GB)
- Starts all services with GPU passthrough + float16 (2Γ faster via Tensor Cores)
cp .env.prod.example .env.prod # fill in your values
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -dWhat docker-compose.prod.yml adds over development:
- GPU reservation (
nvidiadriver, count=1) for backend + celery-worker float16inference enabled automatically on CUDA β ~2Γ speedup- Persistent
musetalk_modelsvolume (survive container restarts) - No source-code bind mounts (runs from built image)
- Log rotation (100 MB max, 5 files)
- Flower disabled (security)
# Check GPU is visible in container
docker exec avatar-backend python -c "
import torch
print('CUDA:', torch.cuda.is_available())
print('GPU:', torch.cuda.get_device_name(0))
print('VRAM:', round(torch.cuda.get_device_properties(0).total_memory/1024**3,1), 'GB')
"
# Expected on g5.xlarge:
# CUDA: True
# GPU: NVIDIA A10G
# VRAM: 24.0 GB
# Live GPU utilisation
docker exec avatar-backend nvidia-smiFor a fully managed ECS deployment with RDS + ElastiCache + CloudFront:
cd infrastructure
terraform init
terraform apply -var="environment=production"
bash deploy.sh productionPowered by XTTS v2 β zero-shot voice cloning from 5 seconds of audio.
- Go to Voice tab β Clone Voice
- Record 5β30 s of clear speech (or upload a WAV/MP3)
- Name it β Clone β select it for your session
Every TTS response then uses your cloned voice.
# REST API
curl -X POST http://localhost:8000/api/v1/voices/clone \
-F "audio=@my_voice.wav" -F "name=My Voice" -F "language=en"POST /api/v1/users/register { "email": "...", "username": "...", "password": "..." }
POST /api/v1/users/login form: username=... password=... β { "access_token": "..." }
# All protected routes:
Authorization: Bearer <access_token>POST /api/v1/avatars/upload Upload photo (multipart: file + name)
GET /api/v1/avatars/ List avatars
DELETE /api/v1/avatars/{id} Delete avatar
PUT /api/v1/avatars/{id}/voice Assign voice to avatar
POST /api/v1/sessions/create { "avatar_id": "..." }
POST /api/v1/sessions/{id}/end
GET /api/v1/messages/session/{id}
WS /ws/session/{session_id}
Client β Server:
{ "type": "text", "text": "Hello!" }
{ "type": "audio", "audio": "<base64-webm>" }
{ "type": "set_voice", "voice_wav_path": "/path/to/speaker.wav" }Server β Client:
{ "type": "transcription", "text": "Hello!" }
{ "type": "message", "content": "Hi!", "role": "assistant" }
{ "type": "video_chunk_start","total_chunks": 3 }
{ "type": "video_chunk", "chunk_index": 0, "video_url": "...", "text": "Hi!" }
{ "type": "video_chunk_end" }
{ "type": "status", "message": "Animating part 1 of 3β¦" }
{ "type": "error", "message": "Something went wrong" }Key .env variables:
# LLM
LLM_PROVIDER=anthropic # anthropic | openai | ollama
LLM_MODEL=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=sk-ant-...
# Avatar engine
AVATAR_ENGINE=musetalk # musetalk (GPU recommended) | simple (CPU fallback)
MUSETALK_PATH=models/MuseTalk
# TTS
TTS_PROVIDER=coqui # coqui (XTTS v2) | elevenlabs
ELEVENLABS_API_KEY=...
# STT
WHISPER_MODEL=base # tiny | base | small | medium | large-v3
# Storage
USE_LOCAL_STORAGE=true # false β AWS S3
S3_BUCKET_NAME=...
# Auth
SECRET_KEY=change-me-in-production
JWT_EXPIRATION_HOURS=24| Library | Purpose |
|---|---|
| Next.js 14 + React 18 | App framework |
| TypeScript 5 | Type safety |
| Tailwind CSS | Styling |
| Zustand | Global state |
| Library | Purpose |
|---|---|
| FastAPI | Async REST API + WebSocket |
| SQLAlchemy 2 (async) | ORM with asyncpg |
| PostgreSQL 15 | Primary database |
| Alembic | Migrations |
| Redis 7 | Cache + Celery broker |
| Celery | Background tasks |
| Model | Purpose |
|---|---|
| Claude / GPT-4o / Llama 3 | LLM conversation |
Whisper (faster-whisper) |
Speech-to-text |
| XTTS v2 (Coqui TTS) | TTS + zero-shot voice cloning |
| MuseTalk V1.5 | Photorealistic lip-sync (30 FPS on GPU) |
cd backend
pytest -v # all tests
pytest tests/test_health.py # single module
pytest --cov=app --cov-report=html # HTML coverage- Streaming LLM β start TTS before LLM finishes (token-by-token)
- Emotion-driven animation β detected emotion changes facial expression
- Embeddable widget β drop a talking avatar into any website with 3 lines of JS
- Multi-avatar conversations β two avatars talking to each other
- Long-term memory β RAG + vector DB for persistent context
- Mobile app β React Native iOS/Android client
- Video call integration β replace your face in Zoom/Meet
Q: Do I need a GPU?
A: No β everything runs on CPU. MuseTalk takes 30β90 s/sentence on CPU. For real-time performance, use an AWS g5.xlarge (~$0.30/hr spot).
Q: Can I run it with no API key?
A: Yes β set LLM_PROVIDER=ollama and run Ollama locally. Fully offline and free.
Q: How do I get MuseTalk models?
A: Run bash scripts/setup_musetalk.sh β downloads ~9 GB of models automatically.
Q: Why does the first response take longer? A: The MuseTalk persistent worker loads all models into GPU VRAM on the first request (~60 s on GPU, ~5 min on CPU). Subsequent requests are fast.
Q: What avatar photo works best? A: A clear, well-lit frontal face photo (JPEG/PNG/WebP). Avoid sunglasses or heavy occlusion.
Contributions welcome! Read CONTRIBUTING.md before opening a PR.
git clone https://github.com/PunithVT/ai-avatar-system.git
git checkout -b feat/my-feature
# make changes + tests
git commit -m "feat(backend): add my feature"
git push origin feat/my-featureMIT Β© 2025 β see LICENSE for details.