Skip to content

✨ feat(ml): built-in local OCR via the stella-ml sidecar#600

Draft
vaayne wants to merge 3 commits into
poc/onnx-runtimefrom
poc/onnx-runtime-ocr
Draft

✨ feat(ml): built-in local OCR via the stella-ml sidecar#600
vaayne wants to merge 3 commits into
poc/onnx-runtimefrom
poc/onnx-runtime-ocr

Conversation

@vaayne

@vaayne vaayne commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What

Built-in local OCR on the document-extraction seam, served by the stella-ml
sidecar. Promotes the paddleocr-onnx POC (PP-OCRv5 mobile det+rec, ONNX) into
the sidecar and wires a real POST /v1/extract, then plugs it into
internal/document.Extractor as an image fallback.

Stacked on #598 (the embedding half). Base will retarget to main once #598 merges.

Why

stellad is CGO_ENABLED=0 and can't host onnxruntime, but the sidecar already
exists for embeddings. OCR rides the same process: offline, no API key, no network
egress. The read tool can now turn an unreadable image into text for non-vision
models without a remote OCR dependency.

How

  • Sidecar (cmd/stella-ml): long-lived ocrEngine (det+rec sessions reused),
    real /v1/extract — octet-stream image body + X-Stella-Mime → JSON
    {content, mime_type}. Image decode covers JPEG/PNG/GIF/WebP/BMP/TIFF. OCR is
    optional and all-or-nothing across det/rec/keys, so embedding-only bundles still
    boot (extract → 503).
  • Resolver (internal/mlruntime): resolves det.onnx/rec.onnx/rec_keys.txt
    independently from the embed model; Resolved.HasOCR().
  • Seam (internal/document): single NewExtractor() wraps the build-tagged
    base extractor in a composite that falls back to sidecar OCR for image inputs the
    text layer can't read. Backend injected process-wide via SetLocalOCR (one
    factory, many construction sites). Adapter + STELLA_LOCAL_OCR toggle live in
    setupMLSidecar.

Verified end-to-end (darwin/arm64): composition root → SetLocalOCR
NewExtractor() → sidecar OCR extracts a zh/en/digit page (124 chars). Unit tests
cover the composite (skip-when-text, image-fallback, non-image-skip, no-OCR
passthrough) and optional/partial OCR resolution. format && build && test green.

Known MVP simplifications (→ Phase 4b)

  • Axis-aligned bounding-rect detection; rotated/skewed crop (minAreaRect +
    perspective warp) deferred.
  • No angle classifier (180°-flipped lines).
  • Image-only; PDF rasterization is a separate adapter.
  • Toggle is an env var; moves to deployment config + settings UI in a later phase.

Refs

vaayne added 3 commits June 26, 2026 19:10
Promote the paddleocr-onnx POC engine into the sidecar as a long-lived
det+rec ocrEngine and wire the real POST /v1/extract: octet-stream image
body + X-Stella-Mime -> JSON {content, mime_type}. OCR is optional and
all-or-nothing across its three assets, so an embedding-only bundle still
boots and the endpoint reports 503 until OCR models are installed.

Axis-aligned bounding-rect detection (MVP); rotated/skewed crop via
minAreaRect + perspective warp is deferred to Phase 4b.
Wrap the platform base extractor in a composite that falls back to local
sidecar OCR for image inputs the text layer can't read. Inject the backend
process-wide via document.SetLocalOCR (single extractor factory, many
construction sites). mlruntime resolves det/rec/keys independently from the
embed model; setupMLSidecar passes the OCR flags and installs the adapter,
gated by STELLA_LOCAL_OCR.

Verified end-to-end: composition root -> SetLocalOCR -> NewExtractor ->
sidecar OCR extracts a zh/en/digit page.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant