Skip to content

alxmra/asmice

Repository files navigation

Project submitted for the Practical Assignment # 1 - Speech in IPFL, in University of Aveiro.

Miguel Neto | NºMec 119302

Adaptive Speech Monitoring

Real-time keyword spotting + emotion classification for instruction-critical environments. Built for a driving school context: detects when an instructor delivers a command keyword and scores their emotional state relative to a neutral baseline.

Pipeline: mic -> sliding window -> Whisper KWS -> keyword hit -> emotion + speaker analysis -> score

Stack

Component Model / Library
Keyword spotting Whisper base (pt) + fuzzy match
Emotion classifier wav2vec2-large-robust (audeering, MSP-Podcast)
Speaker deviation w2v-bert-2.0 (Facebook, x-vectors)
Scratch classifier MelCNN (3-layer CNN on log-mel, trained on EmoProsodyPort)
Backend FastAPI + WebSocket (uvicorn)
Frontend Plain HTML/JS

Configured score = 30% wav2vec2 emotion shift + 40% MelCNN emotion shift + 30% speaker deviation from neutral baseline.

Setup

pip install torch transformers fastapi uvicorn openai-whisper numpy scipy

ffmpeg must be on PATH (used by Whisper and infer.py).

Run

python server.py
# open http://localhost:8000

Training (optional)

Pretrained models download automatically on first run. To train the scratch MelCNN:

# prepare dataset (EmoProsodyPort audio has to be extracted respectively in sents/ and pseudosents/)
python prepare_emoprosodyport.py

# train from scratch
python train_scratch.py

# fine-tune wav2vec2 head on EmoProsodyPort
python train.py

# stage-2 fine-tune (unfreeze last N encoder layers)
python finetune.py

Checkpoints are saved to checkpoints/ and checkpoints_scratch/.

Scoring

Edit scoring.py to customize compute_score(). Default weights are in the file header.

Dataset

EmoProsodyPort - Castro & Lima (2010). 368 clips, 7 emotions, 2 native European Portuguese speakers.

Available at https://portulanclarin.net/repository/browse/emoprosodyport/994c2a9ab70711ea8dc202420a000403ae73f66c224b46419406f34218bb76e9/