SignDecode is a computer vision application that performs real-time American Sign Language (ASL) recognition using skeletal hand tracking and neural network classification. The system processes webcam input to detect hand landmarks and classify gestures into alphanumeric characters (A-Z, 0-9).
Key Technical Achievement: By using skeletal coordinate data instead of raw pixel analysis, the system achieves 99% reduction in input dimensionality (63 features vs 4,096 pixels for 64x64 images) while maintaining 92% classification accuracy.
Camera Input → MediaPipe Hand Detection → Landmark Extraction (21 points × 3 coords)
→ MLP Classifier → Temporal Smoothing → Output Display
- Computer Vision: OpenCV, Google MediaPipe
- Machine Learning: TensorFlow/Keras (Multi-Layer Perceptron)
- Backend: Flask (REST API)
- Frontend: Vanilla JavaScript, HTML5, CSS3
- Data Processing: NumPy, Pandas
Input Layer: 63 features (21 hand landmarks × xyz coordinates)
Hidden Layer 1: 128 neurons, ReLU activation, 30% dropout
Hidden Layer 2: 64 neurons, ReLU activation, 30% dropout
Hidden Layer 3: 64 neurons, ReLU activation
Output Layer: 36 neurons (A-Z, 0-9), Softmax activation
Training Configuration:
- Optimizer: Adam
- Loss Function: Sparse Categorical Crossentropy
- Epochs: 50
- Batch Size: 16
- Train/Test Split: 80/20
Performance Metrics:
- Training Accuracy: 92%
- Validation Accuracy: 88%
- Inference Latency: <50ms per frame
- Model Size: ~500KB
SignDecode/
├── src/ # Application source code
│ ├── app.py # Flask server, API endpoints
│ ├── model.py # Model wrapper class
│ ├── utils.py # Keypoint extraction utilities
│ ├── labels.py # Class label mappings
│ ├── text_to_speech.py # Audio output module
│ ├── static/ # Frontend assets (CSS, JS)
│ └── templates/ # HTML templates
├── training/ # Model training pipeline
│ ├── collect_data.py # Data collection utility
│ ├── train_model.py # Model training script
│ └── dataset/ # Training data storage
├── models/ # Trained model artifacts
│ └── sign_language_model.h5
├── run.py # Application entry point
├── requirements.txt # Python dependencies
└── README.md
- Python 3.8+
- Webcam
- Modern web browser (Chrome/Firefox recommended)
# Clone repository
git clone https://github.com/thesakshidigg/SignDecode-Sign-Language-Recognition-.git
cd SignDecode-Sign-Language-Recognition-
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run application
python run.pyNavigate to http://localhost:5000 in your browser.
python run.pyThe Flask server starts on port 5000. The web interface provides:
- Real-time hand tracking visualization
- ASL character recognition
- Text-to-speech output
- Interactive sign language game
Step 1: Collect Training Data
cd training
python collect_data.pyFollow prompts to record hand gestures for each character. Data is saved to training/dataset/sign_data.csv.
Step 2: Train Model
python train_model.pyTrains the neural network and saves the model to models/sign_language_model.h5.
MediaPipe detects 21 anatomical landmarks per hand:
- Wrist (1 point)
- Thumb (4 points)
- Index, Middle, Ring, Pinky fingers (4 points each)
Each landmark provides (x, y, z) coordinates, normalized to [0, 1] range.
The system uses a Multi-Layer Perceptron (MLP) rather than a Convolutional Neural Network (CNN) because:
- Input Type: Structured coordinate data (not spatial image data)
- Efficiency: 63 input features vs 4,096+ for image-based approaches
- Invariance: Skeletal data is inherently invariant to lighting, background, and skin tone
A 15-frame consistency filter prevents prediction flickering:
if predicted_label == last_predicted_label:
frame_count += 1
if frame_count >= THRESHOLD_FRAMES:
output_text += predicted_characterProcesses a single video frame for hand detection and classification.
Request:
{
"image": "data:image/jpeg;base64,..."
}Response:
{
"prediction": "A",
"image": "data:image/jpeg;base64,..."
}Returns current recognized text.
Clears the output text buffer.
- Latency: Sub-50ms inference time enables real-time processing at 30 FPS
- Accuracy Trade-off: Temporal smoothing reduces false positives at the cost of ~500ms recognition delay
- Scalability: Lightweight model enables deployment on edge devices without GPU
- Support for dynamic gestures (word-level recognition using LSTM/GRU)
- Multi-language sign language support (BSL, ISL, etc.)
- Mobile deployment using TensorFlow Lite
- Data augmentation for improved robustness
- REST API for third-party integration
Contributions are welcome. Please follow standard Git workflow:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -m 'Add feature') - Push to branch (
git push origin feature/improvement) - Open a Pull Request
MIT License - see LICENSE file for details.
Sakshi Diggikar
GitHub: @thesakshidigg
- Google MediaPipe for hand tracking framework
- TensorFlow team for ML infrastructure
- ASL dataset contributors