✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar by vaayne · Pull Request #598 · CherryHQ/stella

vaayne · 2026-06-26T10:42:48Z

Draft / stacked on #596. Base is xberg-document-extraction-plan so the diff
shows only the ML-sidecar work; the OCR phase builds on #596's document.Extractor
seam. Rebases onto main once #596 merges.

What

Built-in, offline, private OCR (PaddleOCR PP-OCRv5 ONNX) and embedding
(intfloat/multilingual-e5-small ONNX), hosted in a single native CGO sidecar
(stella-ml) that the CGO_ENABLED=0 main binary drives over a unix-socket
contract. Native runtime + models are runtime-downloaded, mirroring the embedded
Postgres runtime.

Why

internal/embedding has a local/privacy lane but no local provider — privacy
embedding fails instead of running offline.
internal/document (✨ feat: add native document extraction seam #596) routes the read tool through an Extractor seam but
the only non-vision backend shells out to kreuzberg.
Goal: offline + private capability without breaking the pure-Go release model.

How

One CGO sidecar hosts onnxruntime + HF tokenizers + models; stellad stays pure-Go
and is a UDS client + supervisor. Phased (per the native-ml-sidecar spec plan):

1a sidecar protocol + supervisor contract — ✅ landed
1b embed engine (bit-equivalent to Python ref) — ✅ landed
2 signed runtime + model distribution (mirror pgruntime)
3 embedding config/space refactor + local provider
4a/4b OCR MVP behind a toggle, then quality hardening
5 config, docs (EN+ZH), ops polish

Constraints: darwin + linux only (no prebuilt libtokenizers for windows); main
binary stays CGO_ENABLED=0; the sidecar is the only CGO artifact (a nested module,
so the parent's go build ./... skips it).

Landed so far

cmd/stella-ml/ — nested CGO module: versioned HTTP-over-UDS (/healthz,
/v1/embed, /v1/extract stub), per-tenant fairness lanes, request caps, pinned
ORT threads, 0700/0600 socket with stale-socket probe, graceful shutdown. e5
batch encoder; output bit-equivalent to the Python reference (cosine 1.0000003).
internal/ml/ — pure-Go client (decodes the f32 blob, validates the protocol
handshake) + supervisor (lazy spawn, health-gate, backoff restart, reap;
Pdeathsig on linux). Client unit tests + an env-gated real-binary integration
test.

Refs

Closes ✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar #597
Builds on ✨ feat: add native document extraction seam #596 (document.Extractor seam)
Related: ✨ feat: native document extraction with xberg #595

Standalone POC module proving a Go-native OCR engine: PP-OCRv5 mobile det/rec ONNX models via onnxruntime_go, with pure-Go preprocessing, DB postprocess (connected components), and CTC decode. End-to-end image to text works on darwin/arm64. Isolated module (own go.mod) so it does not touch the stella build. onnxruntime lib, models, and binary are gitignored; README documents sources and reproduction. Refs: design notes in this branch; not yet tied to an issue/PR.

Go-native text embeddings: intfloat/multilingual-e5-small via onnxruntime_go, with exact XLM-RoBERTa tokenization through daulet/tokenizers (HF Rust lib, statically linked). Mean-pool + L2 normalize, e5 query/passage prefixes. Verified bit-equivalent to a Python reference (same ONNX + HF tokenizer): cosine 0.99999963, max diff 5e-7. int8 quant keeps cosine 0.99897 vs fp32 at 113M (vs 448M fp32) -> recommended built-in variant. Separate module; models, libtokenizers.a, onnxruntime, and binary gitignored. Refs: not yet tied to an issue/PR.

Proves the CGO sidecar (onnxruntime_go + daulet/tokenizers) builds for darwin+linux from a single macOS host. Findings: - onnxruntime_go is dlopen-only -> no lib needed at build time. - darwin: native clang cross-compiles both arches (zig can't find macOS system libs without a sysroot). - linux: zig cc + a 3-line no-op shim for two GNU libstdc++ iostream-init symbols that esaxx.cpp pulls in but never uses. linux/amd64 binary runs the real tokenizer in docker (ok 1 false). - windows dropped (no prebuilt libtokenizers; decision). Production note: build linux sidecar natively in CI rather than rely on the shim. Spike stays the local-dev cross-build path. Refs #595 #596

Phase 1a/1b of the native ML sidecar. A separate CGO module hosts onnxruntime + the HF tokenizer + models and serves them to the pure-Go stellad over an HTTP-on-unix-socket contract. stellad stays CGO_ENABLED=0; the parent module's `go build ./...` skips this nested module automatically. Phase 1a (contract): versioned protocol (/healthz, /v1/embed, /v1/extract), per-request identity headers (tenant/request-id/deadline) fixed up front so the fairness layer never retrofits identity; separate per-endpoint lanes + per-tenant in-flight cap so one tenant's load can't starve another; request caps; pinned ORT thread counts; 0700 socket dir / 0600 socket with stale-socket probe; graceful shutdown on signal. /v1/extract returns 501 until the OCR engine lands (4a). Phase 1b (embed): multilingual-e5-small ONNX engine promoted from the POC, with a padded-batch encoder. Output is bit-equivalent to the Python reference (cosine 1.0000003, max elementwise diff 5e-7) and vectors are L2-normalized. Smoke-tested on darwin/arm64 against the POC model: /healthz reports version + digest; /v1/embed returns count*dim*4 LE-f32 bytes; oversized batch -> 413; /v1/extract -> 501. Refs: #597, #595, #596

The pure-Go (CGO_ENABLED=0) half of the Phase 1a contract. Client: UDS http.Client speaking protocol v1 — Health(), Embed() (decodes the LE-f32 blob, validates the protocol/count/dim response headers and refuses a sidecar on an unsupported protocol version), Extract() (ready for Phase 4a). Every request carries tenant + a generated request-id + a deadline derived from the context. Supervisor: lazy spawn, background health-gate with a Ready() barrier, exponential-backoff restart that resets once a process stays up past StableFor, and reap on context cancel (SIGTERM then SIGKILL via WaitDelay). Linux adds Pdeathsig=SIGKILL so a hard-killed stellad never orphans the sidecar; darwin relies on the graceful path + the next start's stale-socket probe. Tests: client decode/validation over a unix httptest server (pure Go, runs in CI). An env-gated integration test drives spawn -> health-gate -> embed -> reap against a real stella-ml binary; verified locally end to end (384-dim vector, clean shutdown, socket removed). Refs: #597

Phase 3. The embedding lane was hardwired to a remote API: river.go gated the whole lane on a non-empty APIKey, and the canonical vector space was always api.SpaceKey() — so an offline, key-less deployment could not run embeddings at all, and any non-API provider's vectors were rejected by the indexer's canonical-space guard. Decouple the enable-gate from the API key and let the canonical space follow the selected provider: - config.EmbeddingSettings gains Provider ("api" default | "local"); a pre-Provider config loads as "api" (backward compatible). LoadEmbeddingSettings defaults it. - river.go resolve() branches on provider. "api" keeps the exact prior behavior (gate on APIKey, space = model@dim). "local" gates on a wired sidecar embedder instead of a key, with space = the e5-small key. - localProvider wraps a LocalEmbedder (sidecar adapter; injected via BootConfig so the embedding package keeps no internal/ml dependency), Kind=local, mapping the query/document mode through and stamping the e5 space. A vector-count mismatch is Terminal (no failover). - indexer.go is unchanged: because the space now follows the primary provider, the canonical-space guard accepts the local vectors instead of rejecting them. Deferred (intentionally): "auto" cross-space failover (the risky query-degrade piece) and the UI/API toggle, which is only meaningful once the sidecar is wired (Phase 2) — until then selecting "local" simply disables the lane. Tests: local provider mode/space mapping, count-mismatch-is-Terminal, resolve for the local path, local-without-sidecar => ErrDisabled, sensitive request served locally, provider default backward-compat. Existing API-path tests unchanged. Refs: #597

Phase 2 foundation, mirroring internal/pgruntime. Resolves the sidecar paths (stella-ml binary, libonnxruntime dir, e5 model, tokenizer) from two independently versioned install roots under STELLA_HOME — a runtime bundle (binary + onnxruntime) and a model bundle — so swapping models never re-downloads the libraries. A dev override (STELLA_ML_RUNTIME_DIR / STELLA_ML_MODEL_DIR) points the resolver at a locally built sidecar + models, so local development and the supervisor wiring need no published release. A not-found result is a clean "local ML unavailable" signal (no error), so the feature degrades to disabled exactly like an unconfigured lane. darwin + linux only. The download/verify/atomic-install subcommand (mirrors postgres.go, with the signed-manifest check from D9) is deferred until the stella-ml-runtime release pipeline exists; the resolver already supports the installed layout it will produce. Tests: dev-override resolution and the not-installed path. Refs: #597

Connects the pieces so local embedding actually runs. setupMLSidecar resolves the ML runtime (mlruntime) and, when present, builds the supervisor + a LocalEmbedder adapter (mapping the embedding query/document mode to the sidecar's query/passage prefix), which is injected into the embedding lane. The supervisor is started on the app context via the background task group, so shutdown reaps the sidecar by cancelling parent and waiting. When no runtime is installed, both are nil and local ML stays disabled with no error. With STELLA_ML_RUNTIME_DIR/STELLA_ML_MODEL_DIR pointing at a locally built sidecar + models and the embedding provider set to "local", the full stack runs offline, key-less. Verified end to end (env-gated integration test, skipped in CI): resolve -> supervisor spawn -> health-gate -> LocalEmbedder.EmbedLocal(document) returns a 384-dim e5 vector -> clean reap on cancel. Refs: #597

vaayne added 10 commits June 26, 2026 17:34

💄 style(poc): dprint table alignment in POC READMEs

6b19b96

🎨 style(stella-ml): dprint table reflow in README

71d8c85

This was referenced Jun 26, 2026

✨ feat(ml): built-in local OCR via the stella-ml sidecar #600

Draft

✨ feat(settings): local embedding provider + OCR toggle in the admin UI #601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar#598

✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar#598
vaayne wants to merge 10 commits into
xberg-document-extraction-planfrom
poc/onnx-runtime

vaayne commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vaayne commented Jun 26, 2026

What

Why

How

Landed so far

Refs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant