Project submitted for the Practical Assignment # 1 - Speech in IPFL, in University of Aveiro.
Miguel Neto | NºMec 119302
Real-time keyword spotting + emotion classification for instruction-critical environments. Built for a driving school context: detects when an instructor delivers a command keyword and scores their emotional state relative to a neutral baseline.
Pipeline: mic -> sliding window -> Whisper KWS -> keyword hit -> emotion + speaker analysis -> score
| Component | Model / Library |
|---|---|
| Keyword spotting | Whisper base (pt) + fuzzy match |
| Emotion classifier | wav2vec2-large-robust (audeering, MSP-Podcast) |
| Speaker deviation | w2v-bert-2.0 (Facebook, x-vectors) |
| Scratch classifier | MelCNN (3-layer CNN on log-mel, trained on EmoProsodyPort) |
| Backend | FastAPI + WebSocket (uvicorn) |
| Frontend | Plain HTML/JS |
Configured score = 30% wav2vec2 emotion shift + 40% MelCNN emotion shift + 30% speaker deviation from neutral baseline.
pip install torch transformers fastapi uvicorn openai-whisper numpy scipyffmpeg must be on PATH (used by Whisper and infer.py).
python server.py
# open http://localhost:8000Pretrained models download automatically on first run. To train the scratch MelCNN:
# prepare dataset (EmoProsodyPort audio has to be extracted respectively in sents/ and pseudosents/)
python prepare_emoprosodyport.py
# train from scratch
python train_scratch.py
# fine-tune wav2vec2 head on EmoProsodyPort
python train.py
# stage-2 fine-tune (unfreeze last N encoder layers)
python finetune.pyCheckpoints are saved to checkpoints/ and checkpoints_scratch/.
Edit scoring.py to customize compute_score(). Default weights are in the file header.
EmoProsodyPort - Castro & Lima (2010). 368 clips, 7 emotions, 2 native European Portuguese speakers.