MSc Robotics Dissertation Project | University of Sheffield | 2024
End-to-end deep learning system achieving 96.98% gesture recognition and 95.03% speech recognition accuracy for real-time control of an 11-servo Youbionic Handy Lite robotic hand. Custom CNN architectures integrated with Arduino serial communication enable intuitive human-robot interaction through hand gestures and voice commands.
- Gesture Recognition: 33-layer CNN with 2.8M parameters, 5×5 filters, 128 channels
- Speech Recognition: 33-layer CNN with 1M parameters, Bark-spectrum features, temporal pooling
- Data Collection: Automated webcam (100 images/digit) and microphone (40 recordings/digit) acquisition
- Arduino Control: MATLAB-Arduino serial communication at 57,600 baud with confidence thresholding (0.3)
- Dataset Augmentation: Added 300 gesture and 250 speech samples per class to public datasets
1. Data Acquisition & Preprocessing
- Gesture: Webcam captures 200×200 RGB images → resize to 98×50 → grayscale conversion
- Speech: Microphone records 3-second audio (16 kHz) → energy-based trimming to 1 second → Bark-spectrum spectrograms (99×50, 50 frequency bands, 512-point FFT)
2. Deep Learning Models
Gesture CNN:
- Input: 98×50×3 (RGB converted from grayscale)
- Architecture: 4 blocks × (2 conv layers + batch norm + ReLU + max pool) + dropout (0.2)
- Filters: 128 (5×5), padding: same
- Training: Adam optimiser, piecewise learn rate (0.0001, drop 0.1 every 3 epochs), L2 reg (0.0001), mini-batch 64, 10 epochs
Speech CNN:
- Input: 99×50×1 (Bark-spectrum)
- Architecture: 4 blocks × (2 conv layers + batch norm + ReLU + max pool) + temporal pooling (12×1) + dropout (0.2)
- Filters: 128 (3×3), padding: same
- Training: Adam optimiser, piecewise learn rate (0.0001, drop 0.1 every 5 epochs), L2 reg (0.0001), mini-batch 64, 20 epochs
3. Robotic Control
- Serial communication (COM4, 57,600 baud) to Arduino Nano
- PWM control of 11× SG90 servos via PCA9685 driver (pulse width 150–600 μs)
- Gesture debouncing to prevent duplicate commands
| Category | Technologies |
|---|---|
| Software | MATLAB R2023a, Deep Learning Toolbox, Image Processing Toolbox, Audio Toolbox, Arduino IDE |
| Hardware | Youbionic Handy Lite, Arduino Nano, PCA9685 16-channel PWM driver, webcam, microphone |
| ML Architecture | CNN (30 layers), batch normalisation, dropout, ReLU, max pooling, Adam optimiser |
| Datasets | ASL Sign Language (16,500 images), English Spoken Digits (17,000 audio files) |
| Model | Accuracy | Precision | Recall | F1 Score | Parameters |
|---|---|---|---|---|---|
| Gesture Recognition | 96.98% | 0.97 | 0.97 | 0.97 | 2.8M |
| Speech Recognition | 95.03% | 0.95 | 0.95 | 0.95 | 1.0M |
Gesture Recognition
Image Size: [98, 50, 3]
Filters: 128 (5×5)
Dropout: 0.2
Learning Rate: 0.0001 (piecewise, drop 0.1 every 3 epochs)
Mini-batch: 64
Max Epochs: 10
Train/Val Split: 80/20
L2 Regularisation: 0.0001Speech Recognition
Spectrogram Size: [99, 50, 1]
Filters: 128 (3×3)
Dropout: 0.2
Temporal Pooling: 12×1
Learning Rate: 0.0001 (piecewise, drop 0.1 every 5 epochs)
Mini-batch: 64
Max Epochs: 20
Train/Val Split: 60/40Required MATLAB Toolboxes:
- Deep Learning Toolbox
- Image Processing Toolbox
- Audio Toolbox
- MATLAB Support Package for Arduino Hardware
- MATLAB Support Package for USB Webcams
Hardware:
- USB webcam or built-in camera
- Microphone (USB or built-in)
- Arduino Nano (optional, for full robotic control demo)
- Youbionic Handy Lite with PCA9685 driver (optional)
Collect Training Data:
cd gesture_recognition
run('Webcam_DataAcquisition.m')Train Model:
run('Gesture.m')Real-Time Testing:
run('Cam_testing.m')Collect Audio Data:
cd speech_recognition
run('Mic_DataAcquisition.m')Preprocess Audio:
run('dataPreProcessing.m')Train Model:
run('Speech.m')| Script | Purpose | Key Features |
|---|---|---|
| Gesture.m | Train gesture CNN | 80/20 split, random scaling (0.85–1.35), confusion matrix, F1 score |
| Webcam_DataAcquisition.m | Capture hand gestures | 200×200 ROI, 98×50 resize, BMP output, 100 images/digit |
| Cam_testing.m | Real-time gesture classification | Serial to Arduino, confidence threshold (0.3), debouncing |
| Speech.m | Train speech CNN | Bark-spectrum input, temporal pooling (12×1), 20 epochs |
| Mic_DataAcquisition.m | Record spoken digits | 16 kHz, 3-second recording, energy-based trimming to 1s |
| dataPreProcessing.m | Audio feature extraction | Bark-spectrum (50 bands), 512 FFT, 60/40 train/val split |
% From Cam_testing.m
load('trainedGestureModel.mat');
arduinoSerial = serialport('COM4', 57600);
...% From dataPreProcessing.m
afe = audioFeatureExtractor('SampleRate',16e3,'FFTLength',512,'barkSpectrum',true);% From Gesture.m
layers = [imageInputLayer([98 50 3]) ... dropoutLayer(0.2) ... classificationLayer];- Source: ASL Sign Language Numbers + custom webcam data
- Size: 16,500 images (1,650 per digit)
- Format: Grayscale BMP, 98×50 pixels
- Augmentation: Random scaling (0.85–1.35)
- Train/Val Split: 80/20
- Source: Free Spoken Digit Dataset + custom recordings
- Size: 17,000 audio files (1,700 per digit)
- Format: WAV, 16 kHz mono, 1-second duration
- Preprocessing: Bark-spectrum spectrograms (99×50)
- Train/Val Split: 60/40
Serial communication sends digit labels (0–9) to Arduino Nano. See dissertation Appendix 6.4 for full servo control code.
Pre-trained models are not included due to size limits. Run training scripts to generate:
trainedGestureModel.mattrainedSpeechModel.mat
- Webcam:
webcamlist - Arduino:
serialportlist("available") - GPU:
gpuDeviceCount
- Speech-to-servo control integration pending
- Explore MFCC + RNNs for speech
- Transfer learning to speed training
- Python port for cross-platform support
- Docker container for reproducibility
Full Dissertation: docs/SID230118016_Dissertation.pdf
Supervisor: Dr Payam Soulatiantork
Institution: University of Sheffield, Department of Automatic Control and Systems Engineering (now Dept. of EEE)
Academic Year: 2023–2024
- Dr Payam Soulatiantork for supervision
- Public datasets: ASL Sign Language, Free Spoken Digit Dataset
- Tools: MATLAB, Arduino IDE, draw.io, ChatGPT
This project is licensed under the MIT License – see LICENSE.
Sai Karthik Kagolanu
MSc Robotics, University of Sheffield
GitHub: @SkullKrak7
LinkedIn: linkedin.com/in/sai-karthik-kagolanu
Built with MATLAB • Trained on GPUs • Deployed on Arduino