Speech Emotion Recognition using Machine Learning and Deep Learning
EmoSense is a comprehensive speech emotion recognition system that explores both traditional machine learning and deep learning approaches to classify emotions from audio recordings. The project demonstrates the strengths, limitations, and trade-offs of different modeling techniques for this challenging audio classification task.
Emotion recognition from speech is a complex problem that sits at the intersection of audio signal processing, machine learning, and human psychology. EmoSense implements and compares two distinct approaches to tackle this challenge:
- Multi-Layer Perceptron (MLP) - A traditional neural network with handcrafted features
- Convolutional Neural Network (CNN) - A deep learning approach with automatic feature learning
The project uses the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset, which contains emotional speech recordings from professional actors.
Dataset Characteristics:
- 24 professional actors (12 male, 12 female)
- 8 emotions: neutral, calm, happy, sad, angry, fearful, disgust, surprised
- Multiple intensities and statement variations
- High-quality audio recordings
Methodology:
- Extracts handcrafted audio features: MFCC, Chroma, and Mel Spectrogram
- Averages features over time to create fixed-size feature vectors
- Trains a shallow neural network (300 hidden units) using scikit-learn
- Focuses on 4 emotions: calm, happy, fearful, disgust
Key Findings:
- Achieves moderate accuracy with interpretable features
- Performs well on low-arousal emotions (calm)
- Struggles with high-arousal emotion discrimination (disgust, fearful, happy)
- Feature overlap between similar emotions limits performance
- Lightweight and fast to train
Performance Insights:
- Strong performance on acoustically distinct emotions
- Confusion between emotions with similar arousal levels
- Model confidence correlates with prediction accuracy
- Feature distributions reveal fundamental classification challenges
Methodology:
- Preserves temporal structure of MFCC features (13 coefficients, 3-second segments)
- Uses 1D convolutional layers to automatically learn temporal patterns
- Multi-layer architecture: Conv1D → Dropout → MaxPooling → Conv1D → Dense
- Trained on all 8 RAVDESS emotions with RMSprop optimizer
Key Findings:
- Demonstrates the data requirements of deep learning approaches
- Severe overfitting with limited training samples (~800 total)
- Training accuracy reaches 60% while validation plateaus at 35%
- Architecture capable of learning but constrained by dataset size
- Requires data augmentation or transfer learning for production use
Performance Insights:
- CNNs need substantially more data than traditional ML (thousands vs hundreds of samples)
- Small batch sizes and conservative learning rates hinder convergence
- Low recall across all classes indicates feature learning failure
- Minimal confidence separation between correct/incorrect predictions
- Dataset insufficient for training CNN from scratch
# Clone the repository
git clone https://github.com/yourusername/emosense.git
cd emosense
# Install dependencies
pip install -r requirements.txt
# Download RAVDESS dataset
# Place audio files in samples/Actor_*/ directories