GitHub - MOHD-OMER/emotion-echo: Real-time multimodal AI companion — detects emotion via face + voice, responds conversationally until your mood improves | Gemini · FastAPI · OpenCV · Librosa

Beyond emotion detection — an agentic AI system that listens, understands, and responds like a companion.

🧠 What is EmotionEcho?

Most emotion AI systems detect and stop. EmotionEcho detects and responds.

It reads your emotional state in real time through your face and voice, fuses both signals through a trained multimodal MLP, then engages you in a genuine empathetic conversation via a Gemini-powered agent — adapting its tone and strategy dynamically until your emotional state actually improves.

If you're sad → it comforts. If you're anxious → it grounds you. If you're stressed → it reframes. If you're happy → it celebrates with you.

This is not a chatbot with emotion labels. This is a closed-loop agentic system where perception, reasoning, and response work together continuously.

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      EmotionEcho System                          │
│                                                                 │
│  ┌──────────────────┐         ┌──────────────────┐             │
│  │   Webcam Input   │         │  Microphone Input │             │
│  └────────┬─────────┘         └────────┬──────────┘             │
│           │                            │                         │
│           ▼                            ▼                         │
│  ┌──────────────────┐         ┌──────────────────┐             │
│  │  Face Emotion    │         │  Audio Emotion    │             │
│  │  ViT Model       │         │  Wav2Vec2 Model   │             │
│  │  (google/vit-    │         │  (facebook/       │             │
│  │  base-patch16)   │         │  wav2vec2-base)   │             │
│  │                  │         │                   │             │
│  │  Trained on      │         │  Trained on       │             │
│  │  FER2013         │         │  RAVDESS          │             │
│  │  7 emotions      │         │  8 emotions       │             │
│  │  → probs [7]     │         │  → probs [7]      │             │
│  └────────┬─────────┘         └────────┬──────────┘             │
│           │                            │                         │
│           └──────────────┬─────────────┘                         │
│                          ▼                                       │
│              ┌───────────────────────┐                           │
│              │   Fusion MLP Layer    │                           │
│              │  Input: [7] + [7] = 14│                           │
│              │  Hidden: 64 → 32      │                           │
│              │  Output: 7 emotions   │                           │
│              │  + confidence score   │                           │
│              └───────────┬───────────┘                           │
│                          │                                       │
│                          ▼                                       │
│              ┌───────────────────────┐                           │
│              │    Gemini Agent       │                           │
│              │  Conversational       │                           │
│              │  Reasoning Backbone   │                           │
│              │  + Session Memory     │                           │
│              │  + Strategy Adaptation│                           │
│              └───────────┬───────────┘                           │
│                          │                                       │
│                          ▼                                       │
│              ┌───────────────────────┐                           │
│              │   FastAPI Backend     │                           │
│              │   REST Endpoints      │                           │
│              │   Real-time inference │                           │
│              └───────────────────────┘                           │
└─────────────────────────────────────────────────────────────────┘

✨ Key Features

Feature	Description
🎭 Vision Transformer	ViT fine-tuned on FER2013 — 7 emotion classes, ~35K images
🎙️ Wav2Vec2 Audio	Speech emotion recognition fine-tuned on RAVDESS — 8 emotional states
🔀 Fusion MLP	Custom neural network combining face + audio probabilities
🤖 Gemini Agent	Conversational backbone with session memory and dynamic strategy
🔄 Closed-Loop	Re-detects emotion after each response, adjusts accordingly
🛡️ Responsible AI	Signal validation before triggering responses
⚡ Low Latency	Sub-300ms end-to-end response time

📦 Datasets

Modality	Dataset	Size	Format	Emotions
Face	FER2013	~35,000 images	48×48 grayscale	7
Audio	RAVDESS	1,440 files · 24 actors	.wav	8

Face  → FER2013  : angry · disgust · fear · happy · sad · surprise · neutral
Audio → RAVDESS  : neutral · calm · happy · sad · angry · fearful · disgust · surprised
Fused → 7 unified emotion classes via FusionMLP

📊 Results

Emotion Classification Accuracy : 87%
Emotional States Detected        : 7
  → Angry · Disgust · Fear · Happy · Sad · Surprise · Neutral
End-to-End Response Latency     : < 300ms
Fusion Architecture             : FusionMLP (14 → 64 → 32 → 7)
Face Model trained on           : FER2013 (~35,000 images)
Audio Model trained on          : RAVDESS (1,440 files, 24 actors)

🛠️ Tech Stack

Layer	Technology	Detail
Face Model	ViT (`google/vit-base-patch16-224`)	Fine-tuned on FER2013, 7-class
Audio Model	Wav2Vec2 (`facebook/wav2vec2-base`)	Fine-tuned on RAVDESS, 8-class
Fusion	Custom FusionMLP (PyTorch)	Concatenates face + audio probs
Agent	Gemini API	Conversational reasoning + memory
Backend	FastAPI	REST inference endpoints
ML Libs	HuggingFace Transformers, PyTorch	Model loading + inference
Datasets	FER2013 + RAVDESS	Face (35K imgs) + Audio (1440 files)
Environment	Python 3.9+, Conda	Dependency management

📁 Project Structure

emotion-echo/
│
├── backend/                  # FastAPI application
├── frontend/                 # User interface
├── ml/
│   ├── face_model/           # ViT fine-tuned on FER2013
│   ├── audio_model/          # Wav2Vec2 fine-tuned on RAVDESS
│   └── fusion_model/
│       └── fusion_model.py   # FusionMLP — combines both modalities
├── deployment/               # Deployment configs
├── docs/                     # Documentation
├── scripts/
│   ├── train_face.py         # ViT training on FER2013
│   ├── train_audio_simple.py # Wav2Vec2 training on RAVDESS
│   ├── train_audio_continue.py
│   └── trai

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
backend		backend
deployment		deployment
docs		docs
frontend		frontend
ml		ml
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 What is EmotionEcho?

🏗️ System Architecture

✨ Key Features

📦 Datasets

📊 Results

🛠️ Tech Stack

📁 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 What is EmotionEcho?

🏗️ System Architecture

✨ Key Features

📦 Datasets

📊 Results

🛠️ Tech Stack

📁 Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages