This project provides a screen-seeing AI pipeline for GitHub Codespaces without using mss.
Instead of desktop capture libraries, it uses the browser's native getDisplayMedia() API:
- You open a local web page in your Codespace.
- You choose exactly which screen/window/tab to share.
- The page streams frames to a Python backend.
- The backend analyzes each frame and can respond in text that the browser can speak aloud.
- The backend runs AI-style analysis on each frame.
This works well in Codespaces because the browser is already your UI surface.
- Capture is performed in-browser with
navigator.mediaDevices.getDisplayMedia(). - Backend only receives JPEG frame bytes over a WebSocket.
- No
msspackage or OS framebuffer scraping is used.
- User-selectable screen/window/tab capture.
- Adjustable FPS and JPEG quality.
- Real-time frame analysis endpoint over WebSocket.
- Two-way voice/text interaction:
- Talk to the AI with the browser microphone (Web Speech API), or type a question.
- AI answers with text and can read answers aloud via speech synthesis.
- Built-in analysis:
- brightness estimate
- edge density (scene complexity proxy)
- motion score versus previous frame
- optional OCR (
pytesseract) if installed
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 8000Open forwarded port 8000 in your browser and click Start sharing.
- Browser sends JSON messages to
ws://<host>/ws. - Message types:
frame: base64 JPEG + metadatauser_text: user question from typed input or speech recognition transcript
- Server responds with JSON containing:
- frame analysis payload
assistant_textresponse suitable for TTS playbackcontrol: start/stop and settings updates
- Server responds with JSON containing analysis for each frame.
getDisplayMediarequires HTTPS or localhost context. Forwarded Codespaces URLs satisfy this.- Screen chooser is controlled by the browser for security; apps cannot bypass it.
- For better performance, keep FPS between 1–8 unless you have strong CPU headroom.
- Browser microphone/speech features depend on browser support (
SpeechRecognition/webkitSpeechRecognition).
You can route sampled frames to a multimodal model for higher-level reasoning.
A helper stub is included in vision_llm.py for sending a frame to an OpenAI-compatible endpoint.