Skip to content

PunithVT/ai-avatar-system

Repository files navigation

🎭 AvatarAI β€” Real-Time AI Avatar Platform

Upload a photo Β· Clone a voice Β· Talk to any face in real time

Stars Forks Issues MIT License

Quick Start Β· Features Β· Architecture Β· GPU / AWS Deploy Β· API Β· Roadmap

The most complete open-source AI talking avatar system. Real-time lip-sync Β· Zero-shot voice cloning Β· Multi-LLM Β· Runs 100% locally or on AWS.


🎬 What is AvatarAI?

AvatarAI is an open-source, production-ready platform for building photorealistic AI avatar conversations. Upload any face photo, clone a voice from a 5-second audio clip, and have a real-time conversation β€” with lip-sync video generated on every single response.

[mic input]  β†’  Whisper STT  β†’  Claude / GPT-4  β†’  XTTS v2 TTS  β†’  MuseTalk lip-sync  β†’  [video]
                                 < 2–4 s first chunk on AWS GPU >

What makes AvatarAI different:

  • 🎀 Zero-shot voice cloning β€” 5 seconds of audio is all you need (XTTS v2)
  • 🎭 Any face, any language β€” upload a JPEG, pick from 18 languages, start talking
  • ⚑ Sentence-chunk streaming β€” first video chunk plays while the rest is still being generated
  • 😴 Idle animation β€” avatar breathes and glows while waiting, no blank screens
  • πŸ”’ 100% local mode β€” nothing leaves your machine
  • πŸ”Œ Multi-LLM β€” Claude, GPT-4, or Llama 3 (free, local via Ollama)
  • πŸš€ AWS GPU deployment β€” one-command deploy to g5.xlarge for true real-time (~30 FPS)
  • πŸ—οΈ Production-ready β€” JWT auth, rate limiting, S3 storage, Terraform IaC

✨ Features

Category Details
πŸ€– LLM Backends Claude (Anthropic) Β· GPT-4o (OpenAI) Β· Llama 3 (Ollama, local)
🎀 Voice Cloning Record 5–30 s β†’ XTTS v2 zero-shot speaker embedding
πŸ—£οΈ Speech-to-Text OpenAI Whisper (faster-whisper, CUDA-accelerated), 18+ languages
🎬 Lip-Sync Video MuseTalk V1.5 (30 FPS on GPU) · FFmpeg fallback (CPU)
⚑ Streaming Pipeline Sentence chunks stream over WebSocket as they complete
😴 Idle Animation CSS breathing animation while avatar waits β€” no blank screens
😊 Emotion Detection Live emotion badges per message
🌍 18+ Languages Whisper multilingual STT + XTTS v2 multilingual TTS
🏠 Local-First Storage USE_LOCAL_STORAGE=true β€” no AWS needed for dev
πŸ” Auth & Sessions JWT authentication, conversation history, persistent sessions
πŸ“Š Observability Prometheus Β· Celery Flower Β· Sentry Β· structured logging
πŸ§ͺ Tested Full pytest suite β€” users, avatars, sessions, health checks
πŸš€ AWS GPU Deploy One-command g5.xlarge deploy with CUDA 11.8 + float16

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Browser / Client                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚Avatar Studioβ”‚  β”‚ Voice Studio β”‚  β”‚   Chat Interface     β”‚ β”‚
β”‚  β”‚  (upload)   β”‚  β”‚  (cloning)   β”‚  β”‚ Idle anim + chunks   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ REST           β”‚ REST                β”‚ WebSocket
          β–Ό                β–Ό                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       FastAPI Backend                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                  WebSocket Manager                    β”‚    β”‚
β”‚  β”‚  split sentences β†’ TTS β†’ MuseTalk β†’ stream chunks    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Whisper β”‚ β”‚Claude/GPT β”‚ β”‚ XTTS v2  β”‚ β”‚  MuseTalk     β”‚  β”‚
β”‚  β”‚   STT    β”‚ β”‚  / Llama  β”‚ β”‚   TTS    β”‚ β”‚  (GPU/CPU)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚PostgreSQLβ”‚ β”‚  Redis   β”‚  β”‚  Celery  β”‚ β”‚ Local FS / S3 β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Real-Time Data Flow (one conversation turn)

[User types / speaks]
        β”‚
        β–Ό
  Whisper STT ─────────────────► transcript
        β”‚
        β–Ό
  Claude / GPT / Llama ────────► full response text
        β”‚
        β–Ό
  Split into sentences ────────► ["Hello!", "How are you?", ...]
        β”‚
        β”œβ”€β”€ sentence 1 β†’ XTTS β†’ MuseTalk β†’ video_chunk WS β†’ browser plays
        β”œβ”€β”€ sentence 2 β†’ XTTS β†’ MuseTalk β†’ video_chunk WS β†’ queued
        └── sentence N β†’ XTTS β†’ MuseTalk β†’ video_chunk WS β†’ queued

πŸ“ Project Structure

ai-avatar-system/
β”œβ”€β”€ backend/                    # FastAPI application
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/v1/             # REST endpoints (users, avatars, sessions, messages)
β”‚   β”‚   β”œβ”€β”€ services/           # Core services (LLM, TTS, STT, animator, storage)
β”‚   β”‚   β”œβ”€β”€ models/             # SQLAlchemy DB models
β”‚   β”‚   └── websocket.py        # Real-time WebSocket handler + sentence streaming
β”‚   β”œβ”€β”€ alembic/                # Database migrations
β”‚   β”œβ”€β”€ models/MuseTalk/        # MuseTalk V1.5 (lip-sync engine)
β”‚   β”‚   └── scripts/
β”‚   β”‚       └── musetalk_worker.py  # Persistent worker (models loaded once)
β”‚   β”œβ”€β”€ tests/                  # pytest suite
β”‚   β”œβ”€β”€ Dockerfile              # CUDA 11.8 base image
β”‚   └── requirements.txt
β”œβ”€β”€ frontend/                   # Next.js 14 application
β”‚   β”œβ”€β”€ app/                    # App Router pages
β”‚   β”œβ”€β”€ components/             # React components (ChatInterface, IdleAvatar, etc.)
β”‚   β”œβ”€β”€ lib/api.ts              # Axios API client
β”‚   └── store/                  # Zustand global state
β”œβ”€β”€ nginx/
β”‚   └── nginx.conf              # Reverse proxy (HTTP β†’ backend/frontend, WebSocket)
β”œβ”€β”€ infrastructure/
β”‚   β”œβ”€β”€ main.tf                 # AWS Terraform (ECS, RDS, ElastiCache, S3, CloudFront)
β”‚   └── variables.tf
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ setup_musetalk.sh       # Download MuseTalk models (~9 GB)
β”‚   └── deploy-aws.sh           # One-command EC2 GPU deployment
β”œβ”€β”€ docker-compose.yml          # Development (CPU) β€” all services
β”œβ”€β”€ docker-compose.prod.yml     # Production overrides (GPU, no bind mounts, logging)
β”œβ”€β”€ deploy.sh                   # ECR push + Terraform deploy (ECS path)
β”œβ”€β”€ .env.example                # Development env template
└── .env.prod.example           # Production env template

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose v2+ (recommended)
  • OR: Python 3.10+, Node.js 18+, FFmpeg, PostgreSQL, Redis

Option A β€” Docker / CPU (development)

git clone https://github.com/PunithVT/ai-avatar-system.git
cd ai-avatar-system
cp .env.example .env          # add your ANTHROPIC_API_KEY (or OPENAI_API_KEY)
docker compose up -d
Service URL
πŸ–₯️ Frontend http://localhost:3000
βš™οΈ Backend API http://localhost:8000
πŸ“– Swagger Docs http://localhost:8000/docs
🌸 Celery Flower http://localhost:5555

No AWS required. Set USE_LOCAL_STORAGE=true (default) β€” uploads saved to backend/uploads/.

Option B β€” Manual (development)

# Backend
cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp ../.env.example ../.env
alembic upgrade head
uvicorn main:app --reload --port 8000

# Frontend (new terminal)
cd frontend
npm install
npm run dev

Option C β€” Enable MuseTalk Lip-Sync

# Download models (~9 GB, one-time)
bash scripts/setup_musetalk.sh

# Set in .env
AVATAR_ENGINE=musetalk

# Restart
docker compose restart backend

πŸš€ GPU & AWS Deployment

MuseTalk achieves 30 FPS at 256Γ—256 on a V100-class GPU (source: MuseTalk paper). On CPU it is 30–50Γ— slower. Deploying on AWS gets you genuine real-time performance.

Recommended Instance

Instance GPU VRAM Spot $/hr MuseTalk FPS
g4dn.xlarge T4 16 GB ~$0.16 ~15–20 FPS
g5.xlarge A10G 24 GB ~$0.30 ~30 FPS βœ“
g6.xlarge L4 24 GB ~$0.24 ~30 FPS βœ“

Recommended: g5.xlarge Spot (~$72/mo at 8 hrs/day).

One-Command EC2 Deploy

# 1. Launch g5.xlarge with Ubuntu 22.04 LTS, SSH in, then:
bash <(curl -fsSL https://raw.githubusercontent.com/PunithVT/ai-avatar-system/main/scripts/deploy-aws.sh)

# 2. Fill in API keys:
nano /opt/ai-avatar-system/.env.prod

# 3. Redeploy with your keys:
bash /opt/ai-avatar-system/scripts/deploy-aws.sh --update

The script automatically:

  • Installs Docker + nvidia-docker2
  • Verifies GPU is accessible
  • Downloads MuseTalk models (~9 GB)
  • Starts all services with GPU passthrough + float16 (2Γ— faster via Tensor Cores)

Manual Production Docker

cp .env.prod.example .env.prod   # fill in your values
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

What docker-compose.prod.yml adds over development:

  • GPU reservation (nvidia driver, count=1) for backend + celery-worker
  • float16 inference enabled automatically on CUDA β†’ ~2Γ— speedup
  • Persistent musetalk_models volume (survive container restarts)
  • No source-code bind mounts (runs from built image)
  • Log rotation (100 MB max, 5 files)
  • Flower disabled (security)

Verify GPU is Working

# Check GPU is visible in container
docker exec avatar-backend python -c "
import torch
print('CUDA:', torch.cuda.is_available())
print('GPU:', torch.cuda.get_device_name(0))
print('VRAM:', round(torch.cuda.get_device_properties(0).total_memory/1024**3,1), 'GB')
"

# Expected on g5.xlarge:
# CUDA: True
# GPU: NVIDIA A10G
# VRAM: 24.0 GB

# Live GPU utilisation
docker exec avatar-backend nvidia-smi

AWS Terraform (ECS Path)

For a fully managed ECS deployment with RDS + ElastiCache + CloudFront:

cd infrastructure
terraform init
terraform apply -var="environment=production"
bash deploy.sh production

🎀 Voice Cloning

Powered by XTTS v2 β€” zero-shot voice cloning from 5 seconds of audio.

  1. Go to Voice tab β†’ Clone Voice
  2. Record 5–30 s of clear speech (or upload a WAV/MP3)
  3. Name it β†’ Clone β†’ select it for your session

Every TTS response then uses your cloned voice.

# REST API
curl -X POST http://localhost:8000/api/v1/voices/clone \
  -F "audio=@my_voice.wav" -F "name=My Voice" -F "language=en"

πŸ“‘ API Reference

Authentication

POST /api/v1/users/register   { "email": "...", "username": "...", "password": "..." }
POST /api/v1/users/login      form: username=... password=...   β†’ { "access_token": "..." }

# All protected routes:
Authorization: Bearer <access_token>

Avatars

POST   /api/v1/avatars/upload        Upload photo (multipart: file + name)
GET    /api/v1/avatars/              List avatars
DELETE /api/v1/avatars/{id}          Delete avatar
PUT    /api/v1/avatars/{id}/voice    Assign voice to avatar

Sessions & Messages

POST   /api/v1/sessions/create       { "avatar_id": "..." }
POST   /api/v1/sessions/{id}/end
GET    /api/v1/messages/session/{id}

WebSocket

WS  /ws/session/{session_id}

Client β†’ Server:

{ "type": "text",      "text": "Hello!" }
{ "type": "audio",     "audio": "<base64-webm>" }
{ "type": "set_voice", "voice_wav_path": "/path/to/speaker.wav" }

Server β†’ Client:

{ "type": "transcription",   "text": "Hello!" }
{ "type": "message",         "content": "Hi!", "role": "assistant" }
{ "type": "video_chunk_start","total_chunks": 3 }
{ "type": "video_chunk",     "chunk_index": 0, "video_url": "...", "text": "Hi!" }
{ "type": "video_chunk_end" }
{ "type": "status",          "message": "Animating part 1 of 3…" }
{ "type": "error",           "message": "Something went wrong" }

βš™οΈ Configuration

Key .env variables:

# LLM
LLM_PROVIDER=anthropic            # anthropic | openai | ollama
LLM_MODEL=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=sk-ant-...

# Avatar engine
AVATAR_ENGINE=musetalk            # musetalk (GPU recommended) | simple (CPU fallback)
MUSETALK_PATH=models/MuseTalk

# TTS
TTS_PROVIDER=coqui                # coqui (XTTS v2) | elevenlabs
ELEVENLABS_API_KEY=...

# STT
WHISPER_MODEL=base                # tiny | base | small | medium | large-v3

# Storage
USE_LOCAL_STORAGE=true            # false β†’ AWS S3
S3_BUCKET_NAME=...

# Auth
SECRET_KEY=change-me-in-production
JWT_EXPIRATION_HOURS=24

πŸ› οΈ Tech Stack

Frontend

Library Purpose
Next.js 14 + React 18 App framework
TypeScript 5 Type safety
Tailwind CSS Styling
Zustand Global state

Backend

Library Purpose
FastAPI Async REST API + WebSocket
SQLAlchemy 2 (async) ORM with asyncpg
PostgreSQL 15 Primary database
Alembic Migrations
Redis 7 Cache + Celery broker
Celery Background tasks

AI / ML

Model Purpose
Claude / GPT-4o / Llama 3 LLM conversation
Whisper (faster-whisper) Speech-to-text
XTTS v2 (Coqui TTS) TTS + zero-shot voice cloning
MuseTalk V1.5 Photorealistic lip-sync (30 FPS on GPU)

πŸ§ͺ Running Tests

cd backend
pytest -v                           # all tests
pytest tests/test_health.py         # single module
pytest --cov=app --cov-report=html  # HTML coverage

πŸ—ΊοΈ Roadmap

  • Streaming LLM β€” start TTS before LLM finishes (token-by-token)
  • Emotion-driven animation β€” detected emotion changes facial expression
  • Embeddable widget β€” drop a talking avatar into any website with 3 lines of JS
  • Multi-avatar conversations β€” two avatars talking to each other
  • Long-term memory β€” RAG + vector DB for persistent context
  • Mobile app β€” React Native iOS/Android client
  • Video call integration β€” replace your face in Zoom/Meet

❓ FAQ

Q: Do I need a GPU? A: No β€” everything runs on CPU. MuseTalk takes 30–90 s/sentence on CPU. For real-time performance, use an AWS g5.xlarge (~$0.30/hr spot).

Q: Can I run it with no API key? A: Yes β€” set LLM_PROVIDER=ollama and run Ollama locally. Fully offline and free.

Q: How do I get MuseTalk models? A: Run bash scripts/setup_musetalk.sh β€” downloads ~9 GB of models automatically.

Q: Why does the first response take longer? A: The MuseTalk persistent worker loads all models into GPU VRAM on the first request (~60 s on GPU, ~5 min on CPU). Subsequent requests are fast.

Q: What avatar photo works best? A: A clear, well-lit frontal face photo (JPEG/PNG/WebP). Avoid sunglasses or heavy occlusion.


🀝 Contributing

Contributions welcome! Read CONTRIBUTING.md before opening a PR.

git clone https://github.com/PunithVT/ai-avatar-system.git
git checkout -b feat/my-feature
# make changes + tests
git commit -m "feat(backend): add my feature"
git push origin feat/my-feature

πŸ“„ License

MIT Β© 2025 β€” see LICENSE for details.


If AvatarAI saves you time or inspires your project, please ⭐ star the repo.



Built with FastAPI Β· Next.js Β· MuseTalk V1.5 Β· XTTS v2 Β· Whisper Β· Claude AI

About

🎭 Open-source AI avatar platform β€” upload a photo, clone a voice, talk to any face in real time. Lip-sync video, voice cloning, WebSocket streaming. Powered by Claude, Whisper & MuseTalk.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors