Beyond emotion detection — an agentic AI system that listens, understands, and responds like a companion.
Most emotion AI systems detect and stop. EmotionEcho detects and responds.
It reads your emotional state in real time through your face and voice, fuses both signals through a trained multimodal MLP, then engages you in a genuine empathetic conversation via a Gemini-powered agent — adapting its tone and strategy dynamically until your emotional state actually improves.
If you're sad → it comforts. If you're anxious → it grounds you. If you're stressed → it reframes. If you're happy → it celebrates with you.
This is not a chatbot with emotion labels. This is a closed-loop agentic system where perception, reasoning, and response work together continuously.
┌─────────────────────────────────────────────────────────────────┐
│ EmotionEcho System │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Webcam Input │ │ Microphone Input │ │
│ └────────┬─────────┘ └────────┬──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Face Emotion │ │ Audio Emotion │ │
│ │ ViT Model │ │ Wav2Vec2 Model │ │
│ │ (google/vit- │ │ (facebook/ │ │
│ │ base-patch16) │ │ wav2vec2-base) │ │
│ │ │ │ │ │
│ │ Trained on │ │ Trained on │ │
│ │ FER2013 │ │ RAVDESS │ │
│ │ 7 emotions │ │ 8 emotions │ │
│ │ → probs [7] │ │ → probs [7] │ │
│ └────────┬─────────┘ └────────┬──────────┘ │
│ │ │ │
│ └──────────────┬─────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Fusion MLP Layer │ │
│ │ Input: [7] + [7] = 14│ │
│ │ Hidden: 64 → 32 │ │
│ │ Output: 7 emotions │ │
│ │ + confidence score │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Gemini Agent │ │
│ │ Conversational │ │
│ │ Reasoning Backbone │ │
│ │ + Session Memory │ │
│ │ + Strategy Adaptation│ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ FastAPI Backend │ │
│ │ REST Endpoints │ │
│ │ Real-time inference │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Feature | Description |
|---|---|
| 🎭 Vision Transformer | ViT fine-tuned on FER2013 — 7 emotion classes, ~35K images |
| 🎙️ Wav2Vec2 Audio | Speech emotion recognition fine-tuned on RAVDESS — 8 emotional states |
| 🔀 Fusion MLP | Custom neural network combining face + audio probabilities |
| 🤖 Gemini Agent | Conversational backbone with session memory and dynamic strategy |
| 🔄 Closed-Loop | Re-detects emotion after each response, adjusts accordingly |
| 🛡️ Responsible AI | Signal validation before triggering responses |
| ⚡ Low Latency | Sub-300ms end-to-end response time |
| Modality | Dataset | Size | Format | Emotions |
|---|---|---|---|---|
| Face | FER2013 | ~35,000 images | 48×48 grayscale | 7 |
| Audio | RAVDESS | 1,440 files · 24 actors | .wav | 8 |
Face → FER2013 : angry · disgust · fear · happy · sad · surprise · neutral
Audio → RAVDESS : neutral · calm · happy · sad · angry · fearful · disgust · surprised
Fused → 7 unified emotion classes via FusionMLP
Emotion Classification Accuracy : 87%
Emotional States Detected : 7
→ Angry · Disgust · Fear · Happy · Sad · Surprise · Neutral
End-to-End Response Latency : < 300ms
Fusion Architecture : FusionMLP (14 → 64 → 32 → 7)
Face Model trained on : FER2013 (~35,000 images)
Audio Model trained on : RAVDESS (1,440 files, 24 actors)
| Layer | Technology | Detail |
|---|---|---|
| Face Model | ViT (google/vit-base-patch16-224) |
Fine-tuned on FER2013, 7-class |
| Audio Model | Wav2Vec2 (facebook/wav2vec2-base) |
Fine-tuned on RAVDESS, 8-class |
| Fusion | Custom FusionMLP (PyTorch) | Concatenates face + audio probs |
| Agent | Gemini API | Conversational reasoning + memory |
| Backend | FastAPI | REST inference endpoints |
| ML Libs | HuggingFace Transformers, PyTorch | Model loading + inference |
| Datasets | FER2013 + RAVDESS | Face (35K imgs) + Audio (1440 files) |
| Environment | Python 3.9+, Conda | Dependency management |
emotion-echo/
│
├── backend/ # FastAPI application
├── frontend/ # User interface
├── ml/
│ ├── face_model/ # ViT fine-tuned on FER2013
│ ├── audio_model/ # Wav2Vec2 fine-tuned on RAVDESS
│ └── fusion_model/
│ └── fusion_model.py # FusionMLP — combines both modalities
├── deployment/ # Deployment configs
├── docs/ # Documentation
├── scripts/
│ ├── train_face.py # ViT training on FER2013
│ ├── train_audio_simple.py # Wav2Vec2 training on RAVDESS
│ ├── train_audio_continue.py
│ └── trai