VisionX

Voice-activated object detection for the visually impaired.

1. Project Overview

VisionX is an assistive perception system designed to help visually impaired users understand their surroundings through structured sensory processing and calm feedback.

VisionX is not a single application. It is a perception philosophy implemented across multiple platforms (Desktop, Mobile, and future embedded systems), each respecting its own physical and system constraints.

Core Promise

Local-first (privacy-preserving)
Deterministic and explainable flow
Human-centered feedback

2. The Core Idea

World → Sensors → Signals → Meaning → Feedback

VisionX transforms raw physical signals (light, sound, vibration) into contextual meaning and communicates that meaning back to the user in the least cognitively demanding way possible.

↑ Back to Top

3. Design Philosophy

3.1 Local-First

All perception and reasoning happen on-device. No cloud inference, no hidden data transfer.

3.2 Clarity Over Cleverness

The system prioritizes understandable pipelines over opaque abstractions.

3.3 Assistive-Centered Design

Latency, predictability, and calm output matter more than raw accuracy.

3.4 Learnable System

VisionX is meant to be read, understood, and extended.

↑ Back to Top

4. Shared Conceptual Architecture

This architecture represents the idea, not the repository layout.

flowchart TD
    A[Sensor Input] --> B[Signal Processing]
    B --> C[Perception Models]
    C --> D[Context Reasoning]
    D --> E[User Feedback]

↑ Back to Top

5. Execution Flow (Conceptual)

sequenceDiagram
    participant User
    participant System
    participant Sensors
    participant Models
    participant Feedback
    User->>System: Trigger Perception
    System->>Sensors: Capture Signals
    Sensors->>Models: Processed Data
    Models->>Feedback: Contextual Meaning
    Feedback->>User: Audio / Haptics

↑ Back to Top

6. VisionX Variants

VisionX exists as multiple variants, each adapting the same conceptual pipeline to different constraints.

6.1 Variant Overview

Variant	Platform	Constraints	Interaction
Desktop	Linux / Windows	Power-rich, static	Keyboard / Voice
Mobile	Android	Battery, thermal	Touch / Voice
Future	Wearables	Ultra-low power	Audio / Haptics

↑ Back to Top

7. Variant Architectures

7.1 Desktop Architecture

flowchart TD
    Cam[USB Camera] --> OpenCV
    OpenCV --> YOLO
    YOLO --> Reasoner
    Reasoner --> TTS
    Reasoner --> CLI

Characteristics:

Continuous capture
Long-running processes
Higher memory budget

↑ Back to Architectures

7.2 Mobile Architecture

flowchart TD
    Cam[Android Camera] --> FrameLimiter
    FrameLimiter --> TFLite
    TFLite --> ContextFilter
    ContextFilter --> AndroidTTS
    ContextFilter --> Haptics

Characteristics:

On-demand execution
Battery-aware processing
Aggressive resource cleanup

↑ Back to Architectures

↑ Back to Top

8. Execution Flow per Variant

8.1 Desktop Flow

graph TD
    Start([App Starts]) --> Init[Initialize VisionX App]
    Init --> LoadConfig[Load Config<br/>Camera ID, YOLO Model, etc.]
    LoadConfig --> CreateSupervisor[Create VisionSupervisor]
   
    CreateSupervisor --> InitServices[Initialize Services]
    InitServices --> InitCamera[Camera Service<br/>OpenCVCamera]
    InitServices --> InitYOLO[Vision Service<br/>YOLOVisionService]
    InitServices --> InitTTS[Audio Service<br/>Pyttsx3TTS]
    InitServices --> InitVosk[Speech Recognition<br/>VoskSpeechRecognition]
   
    InitCamera --> WarmUp[Warm Up YOLO Model]
    InitYOLO --> WarmUp
    InitTTS --> WarmUp
    InitVosk --> WarmUp
   
    WarmUp --> TTSReady[TTS: 'Ready, Say start']
    TTSReady --> StartListening[Start Voice Recognition Thread]
   
    StartListening --> ListenLoop{Listening for<br/>Trigger Words}
   
    ListenLoop -->|Microphone Input| VoskProcessing[Vosk Processes Audio]
    VoskProcessing --> RecognizeText{Text Recognized?}
   
    RecognizeText -->|No| ListenLoop
    RecognizeText -->|Yes| CheckTrigger{Contains Trigger Word?<br/>start, detect, see, look}
   
    CheckTrigger -->|No| ListenLoop
    CheckTrigger -->|Yes| CheckCooldown{Cooldown Active?<br/>3 seconds}
   
    CheckCooldown -->|Yes| ListenLoop
    CheckCooldown -->|No| VoiceCommand[Voice Command Detected]
   
    VoiceCommand --> UpdateCooldown[Update Last Detection Time]
    UpdateCooldown --> StartDetection[Start Detection Thread]
   
    StartDetection --> TTSDetecting[TTS: 'Detecting']
    TTSDetecting --> OpenCamera[Open Camera]
    OpenCamera --> CaptureFrame[Capture Single Frame]
   
    CaptureFrame --> YOLOProcess[YOLO Processes Frame]
    YOLOProcess --> DetectObjects[Detect Objects<br/>Confidence > 0.5]
   
    DetectObjects --> CreateDetections[Create Detection Objects<br/>class_name, confidence, bbox]
    CreateDetections --> GroupObjects[Group by Class Name<br/>Count Objects]
   
    GroupObjects --> GenerateDesc[Generate Natural Language<br/>Description UseCase]
   
    GenerateDesc --> CheckCount{How Many Objects?}
    CheckCount -->|0| NoObjects[Description: 'I don't see<br/>anything clearly']
    CheckCount -->|1| SingleObject[Description: 'I see a person']
    CheckCount -->|2| TwoObjects[Description: 'I see a person<br/>and a laptop']
    CheckCount -->|3+| MultiObjects[Description: 'I see a person,<br/>a laptop, and 2 cups']
   
    NoObjects --> Speak[TTS Speaks Description]
    SingleObject --> Speak
    TwoObjects --> Speak
    MultiObjects --> Speak
   
    Speak --> ReleaseCamera[Release Camera]
    ReleaseCamera --> UpdateUI[Update UI Status]
    UpdateUI --> BackToListening[Back to Listening Mode]
   
    BackToListening --> ListenLoop
   
    style Start fill:#4CAF50
    style ListenLoop fill:#2196F3
    style VoiceCommand fill:#FF9800
    style YOLOProcess fill:#9C27B0
    style Speak fill:#F44336
    style BackToListening fill:#4CAF50

↑ Back to Execution Flows

8.2 Mobile Flow

sequenceDiagram
    User->>App: Tap / Voice
    App->>Camera: Single Capture
    Camera->>Model: Frame
    Model->>Feedback: Result
    App->>Camera: Release

↑ Back to Execution Flows

↑ Back to Top

9. Subsystem Breakdown

9.1 Signal Layer

flowchart LR
    Mic --> ADC --> DSP --> Features

Handles analog-to-digital conversion, windowing, and feature extraction.

↑ Back to Subsystems

9.2 Perception Layer

flowchart LR
    Input --> Model --> Inference --> Output

Includes vision models (YOLO/TFLite) and speech recognition.

↑ Back to Subsystems

9.3 Reasoning Layer

flowchart TD
    A[Detections] --> B{Relevant?}
    B -- Yes --> C[Prioritize]
    B -- No --> D[Discard]

Responsible for filtering, prioritization, and narrative construction.

↑ Back to Subsystems

9.4 Feedback Layer

flowchart LR
    Context --> TTS
    Context --> Vibration

Converts meaning into calm, understandable feedback.

↑ Back to Subsystems

↑ Back to Top

10. Repository Structures

10.1 Desktop - Core

visionx/
├─ app/                         # UI entry point
│  └─ main.py                   # KivyMD application
├─ orchestrator/                # Main coordination logic
│  └─ supervisor.py             # Orchestrates all services
├─ domain/                      # Business logic
│  ├─ entities.py               # Data models
│  └─ usecases.py               # Use cases
├─ services/                    # External service wrappers
│  ├─ camera.py                 # Camera interface
│  ├─ vision.py                 # YOLO wrapper
│  ├─ audio.py                  # Text-to-speech
│  └─ speech_recognition.py     # Vosk wrapper
├─ infra/                       # Infrastructure
│  └─ config.py                 # Configuration
└─ requirements.txt

↑ Back to Repository Structures

10.2 Mobile

visionx-mobile/
├── app/
├── system/
├── perception/
└── ui/

Structural differences reflect platform constraints, not conceptual divergence.

↑ Back to Repository Structures

↑ Back to Top

11. Contribution Guide

11.1 Who Should Contribute

Curious engineers
Accessibility advocates
System thinkers

11.2 How to Start

Read this document fully
Understand the conceptual flow
Choose a variant
Improve one layer

11.3 Contribution Rules

One concern per PR
Respect architectural boundaries
Document before optimizing

↑ Back to Top

12. Development Philosophy

Preferred

Explicit pipelines
Clear state transitions
Deterministic behavior

Avoid

Hidden state
Magic abstractions
Platform leakage

↑ Back to Top

13. Why Variants Are Not Forks

VisionX variants are not forks.

They are contextual embodiments of the same perception philosophy.

What Changes

Constraints
Execution rhythm
Interfaces

What Never Changes

Perception pipeline
User safety
Local-first commitment

↑ Back to Top

14. Roadmap

flowchart LR
    A[Current] --> B[Next Iteration]
    B --> C[Future Vision]

Planned Features:

Enhanced haptic feedback patterns
Multi-language support
Wearable device integration
Advanced context reasoning
Improved energy efficiency

15. Final Notes

VisionX is not built to impress machines.
It is built to serve humans with engineering discipline.

If you understand this document, you are ready to contribute.

Additional Resources

License

This project is intended for educational purposes as a graduation project.

Acknowledgments

Built with care for the visually impaired community.

FilesExpand file tree

ContributingGuide.md

Latest commit

History

ContributingGuide.md

File metadata and controls

VisionX

Table of Contents

Getting Started

Architecture

Implementation

Contributing

Additional Information

1. Project Overview

Core Promise

2. The Core Idea

3. Design Philosophy

3.1 Local-First

3.2 Clarity Over Cleverness

3.3 Assistive-Centered Design

3.4 Learnable System

4. Shared Conceptual Architecture

5. Execution Flow (Conceptual)

6. VisionX Variants

6.1 Variant Overview

7. Variant Architectures

7.1 Desktop Architecture

7.2 Mobile Architecture

8. Execution Flow per Variant

8.1 Desktop Flow

8.2 Mobile Flow

9. Subsystem Breakdown

9.1 Signal Layer

9.2 Perception Layer

9.3 Reasoning Layer

9.4 Feedback Layer

10. Repository Structures

10.1 Desktop - Core

10.2 Mobile

11. Contribution Guide

11.1 Who Should Contribute

11.2 How to Start

11.3 Contribution Rules

12. Development Philosophy

Preferred

Avoid

13. Why Variants Are Not Forks

What Changes

What Never Changes

14. Roadmap

15. Final Notes

Additional Resources

License

Acknowledgments