A real-time sign language interpretation system that converts sign language gestures into natural speech using computer vision, deep learning, and natural language processing.
- Real-time Sign Language Detection: Custom YOLO12 model trained for 22 sign language gestures
- Multi-gesture Tracking: DeepSORT algorithm for consistent gesture identification
- Natural Language Processing: Converts gesture sequences into grammatically correct sentences
- Voice Synthesis: Text-to-speech conversion with audio file generation
- Professional GUI: Modern PyQt5 interface with dark theme
- Image & Video Support: Process both static images and video files
- Audio Playback: Integrated audio player for generated speech
The system recognizes 22 different sign language gestures:
- Basic: school, sorry, help, easy, work, age, effort, respect
- Location: near, home, village, washroom
- Social: friend, teacher, message, good
- Actions: eating, drinking, pass, fail
- Settings: preset, dress
-
YOLO12Detector (
components/yolo_inference.py)- Custom-trained YOLO model for sign language detection
- Confidence-based filtering
- Bounding box generation
-
DeepSORTTracker (
components/deep_sort_tracker.py)- Object tracking for temporal consistency
- Multi-object tracking capabilities
- Track ID management
-
SentenceBuilder (
components/sentence_builder.py)- Grammar templates and phrase patterns
- Context-aware sentence construction
- NLTK integration for advanced NLP
-
TTSEngine (
components/tts_engine.py)- Text-to-speech conversion
- Audio file generation and management
-
Voice Pipeline (
components/yolo2voice_pipeline.py)- Ollama LLM integration for natural sentence generation
- Context-aware prompts for different gesture types
- DrawingUtils (
utils/draw.py): Visualization and annotation tools - SystemLogger (
utils/system_logger.py): Logging and monitoring - VideoProcessor (
utils/video_utils.py): Video frame handling
Input (Image/Video) → YOLO Detection → DeepSORT Tracking →
Sentence Building → Ollama LLM → Text-to-Speech → Audio Output
- Input Processing: Accept image or video input through GUI
- Gesture Detection: YOLO12 model detects sign language gestures
- Object Tracking: DeepSORT maintains consistent gesture tracking
- Sequence Analysis: Build gesture sequences from tracked objects
- Language Generation: Convert sequences to natural language using Ollama
- Speech Synthesis: Generate audio output using TTS engine
- User Interface: Display results with visual annotations and audio playback
- Python 3.8+
- CUDA-compatible GPU (recommended)
- Ollama server with llama3 model
pip install -r requirements.txtKey dependencies:
ultralytics- YOLO implementationPyQt5- GUI frameworkopencv-python- Computer visionnumpy- Numerical computingpyttsx3- Text-to-speechrequests- HTTP client for Ollamanltk- Natural language processing (optional)
- Install Ollama: https://ollama.ai
- Pull the llama3 model:
ollama pull llama3
- Start Ollama server:
ollama serve
python qt_gui.pypython main.pyAdjust detection sensitivity in the code:
confidence_threshold = 0.7 # Adjust between 0.1-0.9sign2speech/
├── components/ # Core system components
│ ├── deep_sort/ # DeepSORT tracking implementation
│ ├── yolo_inference.py # YOLO detection module
│ ├── deep_sort_tracker.py # Object tracking
│ ├── sentence_builder.py # NLP and grammar
│ ├── tts_engine.py # Text-to-speech
│ └── yolo2voice_pipeline.py # Voice generation pipeline
├── utils/ # Utility functions
│ ├── draw.py # Visualization tools
│ ├── system_logger.py # Logging system
│ └── video_utils.py # Video processing
├── models/ # ML model files
│ └── sign.pt # Custom YOLO model
├── logs/ # System logs and outputs
│ ├── audio_outputs/ # Generated audio files
│ └── *.jpg # Processed images
├── voices/ # Pre-generated voice samples
├── main.py # Main application
├── qt_gui.py # GUI interface
└── requirements.txt # Dependencies
- Frame-by-frame gesture detection
- Temporal consistency through tracking
- Non-blocking TTS processing
- Modern dark theme interface
- Image preview with bounding box visualization
- Integrated audio player
- Real-time status updates
- Context-aware sentence generation
- Grammar templates for natural speech
- Ollama LLM integration for enhanced language quality
# In components/yolo_inference.py
confidence_threshold = 0.5 # Minimum detection confidence
# In main.py
buffer_timeout = 3.0 # Gesture sequence timeout
min_signs_for_sentence = 2 # Minimum gestures for sentence# In components/yolo2voice_pipeline.py
OLLAMA_URL = "http://localhost:11434/api/generate"
model = "llama3" # Ollama model name- Detection Speed: ~30 FPS on GPU
- Accuracy: 85%+ on trained gestures
- Latency: <2 seconds from gesture to speech
- Memory Usage: ~2GB GPU memory
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Note: This system requires a trained YOLO model (sign.pt) and a running Ollama server for full functionality.
