Skip to content

ariel-salgado/emosense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmoSense

Speech Emotion Recognition using Machine Learning and Deep Learning

EmoSense is a comprehensive speech emotion recognition system that explores both traditional machine learning and deep learning approaches to classify emotions from audio recordings. The project demonstrates the strengths, limitations, and trade-offs of different modeling techniques for this challenging audio classification task.

Overview

Emotion recognition from speech is a complex problem that sits at the intersection of audio signal processing, machine learning, and human psychology. EmoSense implements and compares two distinct approaches to tackle this challenge:

  1. Multi-Layer Perceptron (MLP) - A traditional neural network with handcrafted features
  2. Convolutional Neural Network (CNN) - A deep learning approach with automatic feature learning

Dataset

The project uses the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset, which contains emotional speech recordings from professional actors.

Dataset Characteristics:

  • 24 professional actors (12 male, 12 female)
  • 8 emotions: neutral, calm, happy, sad, angry, fearful, disgust, surprised
  • Multiple intensities and statement variations
  • High-quality audio recordings

Approaches

1. Multi-Layer Perceptron (MLP)

Methodology:

  • Extracts handcrafted audio features: MFCC, Chroma, and Mel Spectrogram
  • Averages features over time to create fixed-size feature vectors
  • Trains a shallow neural network (300 hidden units) using scikit-learn
  • Focuses on 4 emotions: calm, happy, fearful, disgust

Key Findings:

  • Achieves moderate accuracy with interpretable features
  • Performs well on low-arousal emotions (calm)
  • Struggles with high-arousal emotion discrimination (disgust, fearful, happy)
  • Feature overlap between similar emotions limits performance
  • Lightweight and fast to train

Performance Insights:

  • Strong performance on acoustically distinct emotions
  • Confusion between emotions with similar arousal levels
  • Model confidence correlates with prediction accuracy
  • Feature distributions reveal fundamental classification challenges

2. Convolutional Neural Network (CNN)

Methodology:

  • Preserves temporal structure of MFCC features (13 coefficients, 3-second segments)
  • Uses 1D convolutional layers to automatically learn temporal patterns
  • Multi-layer architecture: Conv1D → Dropout → MaxPooling → Conv1D → Dense
  • Trained on all 8 RAVDESS emotions with RMSprop optimizer

Key Findings:

  • Demonstrates the data requirements of deep learning approaches
  • Severe overfitting with limited training samples (~800 total)
  • Training accuracy reaches 60% while validation plateaus at 35%
  • Architecture capable of learning but constrained by dataset size
  • Requires data augmentation or transfer learning for production use

Performance Insights:

  • CNNs need substantially more data than traditional ML (thousands vs hundreds of samples)
  • Small batch sizes and conservative learning rates hinder convergence
  • Low recall across all classes indicates feature learning failure
  • Minimal confidence separation between correct/incorrect predictions
  • Dataset insufficient for training CNN from scratch

Installation

# Clone the repository
git clone https://github.com/yourusername/emosense.git
cd emosense

# Install dependencies
pip install -r requirements.txt

# Download RAVDESS dataset
# Place audio files in samples/Actor_*/ directories

About

Speech Emotion Recognition with Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors