Skip to content

HidekiAI/lenzu

Repository files navigation

lenzu 「レンズ」 (LINUX ONLY)

Linux only (X11, GTK3). No Windows or macOS support.

Desktop OCR lens — a transparent floating window that follows the mouse cursor, captures the region under it on demand, and sends it to a local or remote LLM for OCR and translation. Results appear in a separate transparent overlay HUD (lenzu_server).

The key dif:ference from browser extensions like Yomitan/Rikaichan: this operates on images (GPU-rendered video, game windows, PDFs, anything on screen), not UTF-8 text.

beta demo

Japanese OCR result

English translation result

Architecture note: The Windows/winit/GTK4 experiments are archived in prototypes/. The active implementation uses GTK3 (gtk-rs 0.18) on Linux/X11. GTK4 was evaluated and abandoned due to integration complexity — GTK3 provides everything needed and is simpler to build against. See Technical Design for current architecture.

Architecture (Current)

lenzu (GTK3 client)               lenzu_server (Electron)
  floating lens window     UDP     transparent overlay HUD
  X11 root capture       ──────►  renders translated text
  multi-tier OCR backend           ArrowUp/Down moves position
  manages server lifecycle
  1. Capture: x11rb captures the X11 root window directly — bypasses GPU-accelerated and hardware-rendered windows correctly.

  2. OCR/Translation: Confidence-gated local-first pipeline, then multi-tier LLM fallback:

    • Local OCR (jp_detect + manga-ocr-rs) — if detection confidence >= 71% AND OCR confidence >= 71%, returns immediately. No LLM, no network. Both Shift+Click and Ctrl+Shift+Click paths try this first.
    • Local LLM primary (e.g. gemma4:e2b via ollama, 3 s) — fully on-device, no API key needed
    • Local LLM fallbacks (e.g. glm-ocr, qwen2.5vl, 3 s each) — smaller OCR-specialist models
    • Remote fallback (OpenRouter/Gemini 2.0 Flash, 15 s) — cloud fallback when all local paths fail

    All LLM backends use the same production code path (single source of truth in client.rs).
    Streaming ("stream": true) keeps each request's TCP connection alive, preventing ollama's
    server-side write timeout from firing during slow CPU/partial-GPU inference.

  3. Overlay: Formatted text sent via UDP loopback to lenzu_server, an Electron transparent window pinned to screen edge.

Privacy modes

Mode Config API key needed? Images leave device?
Fully local OPENROUTER_API_KEY unset No No
Local-first default No (local) / Yes (remote) Only on fallback
Remote-only (Ctrl+Shift+Click) any Yes Yes

Inference speed on typical hardware

Backend VRAM Typical latency Confidence scoring
jp_detect + manga-ocr-rs (local, no LLM) ~150 MB models ~0.8–2 s per high-confidence crop (CPU) Det 0-100%, OCR 0-100%; >= 71% both = pass
gemma4:e2b — full GPU (8 GB+) ~7.4 GB ~15–30 s N/A (LLM fallback)
gemma4:e2b — partial GPU ~2 GB GPU + CPU 60–120 s N/A (LLM fallback)
glm-ocr — full GPU (4 GB) ~2.2 GB ~5–15 s N/A (LLM fallback)
Gemini 2.0 Flash (remote) ~3–5 s N/A (LLM fallback)

The local OCR path (jp_detect + manga-ocr-rs) is tried first for all capture modes. When both confidence scores pass the 71% gate, no LLM or network call is needed. For 4 GB VRAM cards, this means most clean text regions are handled in under 2 seconds without touching Ollama.

Hardware and Privacy

  • Local-first by default: ollama runs on the same machine; no data leaves the device unless the local models fail and you have OPENROUTER_API_KEY set.
  • Cloud OCR: Automatically falls back to OpenRouter (Gemini 2.0 Flash) when local inference times out. Disable by leaving OPENROUTER_API_KEY unset.
  • Local-first OCR: jp_detect (DBNet) detects text regions with per-box confidence scores; manga-ocr-rs recognizes text with per-result confidence. When both scores pass the 71% gate, no LLM or network is needed.

Related Crates

These companion crates were developed as part of this project and are available on crates.io:

  • jp_detect — real-time scene text detection using DBNet (ONNX). Locates text bounding boxes in manga panels and screenshots.
  • manga-ocr-rs — Japanese manga OCR via ViT encoder + BERT decoder (ONNX). Converts image crops to Japanese text.
  • mecab-furigana-rs — MeCab-based furigana and romaji annotation. Dictionary-accurate readings at ~5 ms per call, with word segmentation and morpheme data.

See OCR Accuracy Scores for unified benchmark results across all engines and prototypes.

Libraries & Dependencies

  • gtk 0.18 — GTK3 bindings (gtk-rs). GTK3, not GTK4.
  • x11rb — X11 protocol (screen capture)
  • cairo-rs — 2D drawing
  • pango / pangocairo — text layout and CJK rendering
  • reqwest — HTTP client (OpenRouter API)
  • isolang — ISO 639-3 language codes
  • Electron (lenzu_server) — transparent overlay window

Build & Run

# 1. Install system dependencies
./scripts/setup.sh

# 2. Set API key
export OPENROUTER_API_KEY=sk-your-key-here

# 3. Build and run (builds lenzu_server on first run)
./scripts/run.sh

Lenzu help screen (Shift+H)

See lenzu/README.md for full configuration reference and controls.

TODO

  • GPU acceleration for jp_detect + manga-ocr-rs (CUDA EP) — would reduce per-crop latency from seconds to milliseconds
  • Wayland support via xdg-desktop-portal
  • Multi-monitor capture at non-zero offsets

About

desktop lens (almost like desktop magnifier) which detects images (initially started as OCR) to real-time analyze (via OpenCV) the small window where the mouse cursor hovers (for example, for OCR, it can then add furigana to all the kanji)

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors