Voice-activated object detection for the visually impaired.
- 4. Shared Conceptual Architecture
- 5. Execution Flow (Conceptual)
- 6. VisionX Variants
- 7. Variant Architectures
- 8. Execution Flow per Variant
- 9. Subsystem Breakdown
VisionX is an assistive perception system designed to help visually impaired users understand their surroundings through structured sensory processing and calm feedback.
VisionX is not a single application. It is a perception philosophy implemented across multiple platforms (Desktop, Mobile, and future embedded systems), each respecting its own physical and system constraints.
- Local-first (privacy-preserving)
- Deterministic and explainable flow
- Human-centered feedback
World → Sensors → Signals → Meaning → Feedback
VisionX transforms raw physical signals (light, sound, vibration) into contextual meaning and communicates that meaning back to the user in the least cognitively demanding way possible.
All perception and reasoning happen on-device. No cloud inference, no hidden data transfer.
The system prioritizes understandable pipelines over opaque abstractions.
Latency, predictability, and calm output matter more than raw accuracy.
VisionX is meant to be read, understood, and extended.
This architecture represents the idea, not the repository layout.
flowchart TD
A[Sensor Input] --> B[Signal Processing]
B --> C[Perception Models]
C --> D[Context Reasoning]
D --> E[User Feedback]
sequenceDiagram
participant User
participant System
participant Sensors
participant Models
participant Feedback
User->>System: Trigger Perception
System->>Sensors: Capture Signals
Sensors->>Models: Processed Data
Models->>Feedback: Contextual Meaning
Feedback->>User: Audio / Haptics
VisionX exists as multiple variants, each adapting the same conceptual pipeline to different constraints.
| Variant | Platform | Constraints | Interaction |
|---|---|---|---|
| Desktop | Linux / Windows | Power-rich, static | Keyboard / Voice |
| Mobile | Android | Battery, thermal | Touch / Voice |
| Future | Wearables | Ultra-low power | Audio / Haptics |
flowchart TD
Cam[USB Camera] --> OpenCV
OpenCV --> YOLO
YOLO --> Reasoner
Reasoner --> TTS
Reasoner --> CLI
Characteristics:
- Continuous capture
- Long-running processes
- Higher memory budget
flowchart TD
Cam[Android Camera] --> FrameLimiter
FrameLimiter --> TFLite
TFLite --> ContextFilter
ContextFilter --> AndroidTTS
ContextFilter --> Haptics
Characteristics:
- On-demand execution
- Battery-aware processing
- Aggressive resource cleanup
graph TD
Start([App Starts]) --> Init[Initialize VisionX App]
Init --> LoadConfig[Load Config<br/>Camera ID, YOLO Model, etc.]
LoadConfig --> CreateSupervisor[Create VisionSupervisor]
CreateSupervisor --> InitServices[Initialize Services]
InitServices --> InitCamera[Camera Service<br/>OpenCVCamera]
InitServices --> InitYOLO[Vision Service<br/>YOLOVisionService]
InitServices --> InitTTS[Audio Service<br/>Pyttsx3TTS]
InitServices --> InitVosk[Speech Recognition<br/>VoskSpeechRecognition]
InitCamera --> WarmUp[Warm Up YOLO Model]
InitYOLO --> WarmUp
InitTTS --> WarmUp
InitVosk --> WarmUp
WarmUp --> TTSReady[TTS: 'Ready, Say start']
TTSReady --> StartListening[Start Voice Recognition Thread]
StartListening --> ListenLoop{Listening for<br/>Trigger Words}
ListenLoop -->|Microphone Input| VoskProcessing[Vosk Processes Audio]
VoskProcessing --> RecognizeText{Text Recognized?}
RecognizeText -->|No| ListenLoop
RecognizeText -->|Yes| CheckTrigger{Contains Trigger Word?<br/>start, detect, see, look}
CheckTrigger -->|No| ListenLoop
CheckTrigger -->|Yes| CheckCooldown{Cooldown Active?<br/>3 seconds}
CheckCooldown -->|Yes| ListenLoop
CheckCooldown -->|No| VoiceCommand[Voice Command Detected]
VoiceCommand --> UpdateCooldown[Update Last Detection Time]
UpdateCooldown --> StartDetection[Start Detection Thread]
StartDetection --> TTSDetecting[TTS: 'Detecting']
TTSDetecting --> OpenCamera[Open Camera]
OpenCamera --> CaptureFrame[Capture Single Frame]
CaptureFrame --> YOLOProcess[YOLO Processes Frame]
YOLOProcess --> DetectObjects[Detect Objects<br/>Confidence > 0.5]
DetectObjects --> CreateDetections[Create Detection Objects<br/>class_name, confidence, bbox]
CreateDetections --> GroupObjects[Group by Class Name<br/>Count Objects]
GroupObjects --> GenerateDesc[Generate Natural Language<br/>Description UseCase]
GenerateDesc --> CheckCount{How Many Objects?}
CheckCount -->|0| NoObjects[Description: 'I don't see<br/>anything clearly']
CheckCount -->|1| SingleObject[Description: 'I see a person']
CheckCount -->|2| TwoObjects[Description: 'I see a person<br/>and a laptop']
CheckCount -->|3+| MultiObjects[Description: 'I see a person,<br/>a laptop, and 2 cups']
NoObjects --> Speak[TTS Speaks Description]
SingleObject --> Speak
TwoObjects --> Speak
MultiObjects --> Speak
Speak --> ReleaseCamera[Release Camera]
ReleaseCamera --> UpdateUI[Update UI Status]
UpdateUI --> BackToListening[Back to Listening Mode]
BackToListening --> ListenLoop
style Start fill:#4CAF50
style ListenLoop fill:#2196F3
style VoiceCommand fill:#FF9800
style YOLOProcess fill:#9C27B0
style Speak fill:#F44336
style BackToListening fill:#4CAF50
sequenceDiagram
User->>App: Tap / Voice
App->>Camera: Single Capture
Camera->>Model: Frame
Model->>Feedback: Result
App->>Camera: Release
flowchart LR
Mic --> ADC --> DSP --> Features
Handles analog-to-digital conversion, windowing, and feature extraction.
flowchart LR
Input --> Model --> Inference --> Output
Includes vision models (YOLO/TFLite) and speech recognition.
flowchart TD
A[Detections] --> B{Relevant?}
B -- Yes --> C[Prioritize]
B -- No --> D[Discard]
Responsible for filtering, prioritization, and narrative construction.
flowchart LR
Context --> TTS
Context --> Vibration
Converts meaning into calm, understandable feedback.
visionx/
├─ app/ # UI entry point
│ └─ main.py # KivyMD application
├─ orchestrator/ # Main coordination logic
│ └─ supervisor.py # Orchestrates all services
├─ domain/ # Business logic
│ ├─ entities.py # Data models
│ └─ usecases.py # Use cases
├─ services/ # External service wrappers
│ ├─ camera.py # Camera interface
│ ├─ vision.py # YOLO wrapper
│ ├─ audio.py # Text-to-speech
│ └─ speech_recognition.py # Vosk wrapper
├─ infra/ # Infrastructure
│ └─ config.py # Configuration
└─ requirements.txt
↑ Back to Repository Structures
visionx-mobile/
├── app/
├── system/
├── perception/
└── ui/
Structural differences reflect platform constraints, not conceptual divergence.
↑ Back to Repository Structures
- Curious engineers
- Accessibility advocates
- System thinkers
- Read this document fully
- Understand the conceptual flow
- Choose a variant
- Improve one layer
- One concern per PR
- Respect architectural boundaries
- Document before optimizing
- Explicit pipelines
- Clear state transitions
- Deterministic behavior
- Hidden state
- Magic abstractions
- Platform leakage
VisionX variants are not forks.
They are contextual embodiments of the same perception philosophy.
- Constraints
- Execution rhythm
- Interfaces
- Perception pipeline
- User safety
- Local-first commitment
flowchart LR
A[Current] --> B[Next Iteration]
B --> C[Future Vision]
Planned Features:
- Enhanced haptic feedback patterns
- Multi-language support
- Wearable device integration
- Advanced context reasoning
- Improved energy efficiency
VisionX is not built to impress machines.
It is built to serve humans with engineering discipline.
If you understand this document, you are ready to contribute.
- Documentation
CommunityComing Soon- GitHub Issues
- GitHub Discussions
This project is intended for educational purposes as a graduation project.
Built with care for the visually impaired community.