Skip to content

✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar#598

Draft
vaayne wants to merge 10 commits into
xberg-document-extraction-planfrom
poc/onnx-runtime
Draft

✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar#598
vaayne wants to merge 10 commits into
xberg-document-extraction-planfrom
poc/onnx-runtime

Conversation

@vaayne

@vaayne vaayne commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Draft / stacked on #596. Base is xberg-document-extraction-plan so the diff
shows only the ML-sidecar work; the OCR phase builds on #596's document.Extractor
seam. Rebases onto main once #596 merges.

What

Built-in, offline, private OCR (PaddleOCR PP-OCRv5 ONNX) and embedding
(intfloat/multilingual-e5-small ONNX), hosted in a single native CGO sidecar
(stella-ml) that the CGO_ENABLED=0 main binary drives over a unix-socket
contract. Native runtime + models are runtime-downloaded, mirroring the embedded
Postgres runtime.

Why

  • internal/embedding has a local/privacy lane but no local provider — privacy
    embedding fails instead of running offline.
  • internal/document (✨ feat: add native document extraction seam #596) routes the read tool through an Extractor seam but
    the only non-vision backend shells out to kreuzberg.
  • Goal: offline + private capability without breaking the pure-Go release model.

How

One CGO sidecar hosts onnxruntime + HF tokenizers + models; stellad stays pure-Go
and is a UDS client + supervisor. Phased (per the native-ml-sidecar spec plan):

  • 1a sidecar protocol + supervisor contract — ✅ landed
  • 1b embed engine (bit-equivalent to Python ref) — ✅ landed
  • 2 signed runtime + model distribution (mirror pgruntime)
  • 3 embedding config/space refactor + local provider
  • 4a/4b OCR MVP behind a toggle, then quality hardening
  • 5 config, docs (EN+ZH), ops polish

Constraints: darwin + linux only (no prebuilt libtokenizers for windows); main
binary stays CGO_ENABLED=0; the sidecar is the only CGO artifact (a nested module,
so the parent's go build ./... skips it).

Landed so far

  • cmd/stella-ml/ — nested CGO module: versioned HTTP-over-UDS (/healthz,
    /v1/embed, /v1/extract stub), per-tenant fairness lanes, request caps, pinned
    ORT threads, 0700/0600 socket with stale-socket probe, graceful shutdown. e5
    batch encoder; output bit-equivalent to the Python reference (cosine 1.0000003).
  • internal/ml/ — pure-Go client (decodes the f32 blob, validates the protocol
    handshake) + supervisor (lazy spawn, health-gate, backoff restart, reap;
    Pdeathsig on linux). Client unit tests + an env-gated real-binary integration
    test.

Refs

vaayne added 10 commits June 26, 2026 17:34
Standalone POC module proving a Go-native OCR engine: PP-OCRv5 mobile
det/rec ONNX models via onnxruntime_go, with pure-Go preprocessing, DB
postprocess (connected components), and CTC decode. End-to-end image to
text works on darwin/arm64.

Isolated module (own go.mod) so it does not touch the stella build.
onnxruntime lib, models, and binary are gitignored; README documents
sources and reproduction.

Refs: design notes in this branch; not yet tied to an issue/PR.
Go-native text embeddings: intfloat/multilingual-e5-small via onnxruntime_go,
with exact XLM-RoBERTa tokenization through daulet/tokenizers (HF Rust lib,
statically linked). Mean-pool + L2 normalize, e5 query/passage prefixes.

Verified bit-equivalent to a Python reference (same ONNX + HF tokenizer):
cosine 0.99999963, max diff 5e-7. int8 quant keeps cosine 0.99897 vs fp32 at
113M (vs 448M fp32) -> recommended built-in variant.

Separate module; models, libtokenizers.a, onnxruntime, and binary gitignored.

Refs: not yet tied to an issue/PR.
Proves the CGO sidecar (onnxruntime_go + daulet/tokenizers) builds for
darwin+linux from a single macOS host. Findings:
- onnxruntime_go is dlopen-only -> no lib needed at build time.
- darwin: native clang cross-compiles both arches (zig can't find macOS
  system libs without a sysroot).
- linux: zig cc + a 3-line no-op shim for two GNU libstdc++ iostream-init
  symbols that esaxx.cpp pulls in but never uses. linux/amd64 binary runs
  the real tokenizer in docker (ok 1 false).
- windows dropped (no prebuilt libtokenizers; decision).

Production note: build linux sidecar natively in CI rather than rely on the
shim. Spike stays the local-dev cross-build path.

Refs #595 #596
Phase 1a/1b of the native ML sidecar. A separate CGO module hosts onnxruntime +
the HF tokenizer + models and serves them to the pure-Go stellad over an
HTTP-on-unix-socket contract. stellad stays CGO_ENABLED=0; the parent module's
`go build ./...` skips this nested module automatically.

Phase 1a (contract): versioned protocol (/healthz, /v1/embed, /v1/extract),
per-request identity headers (tenant/request-id/deadline) fixed up front so the
fairness layer never retrofits identity; separate per-endpoint lanes + per-tenant
in-flight cap so one tenant's load can't starve another; request caps; pinned ORT
thread counts; 0700 socket dir / 0600 socket with stale-socket probe; graceful
shutdown on signal. /v1/extract returns 501 until the OCR engine lands (4a).

Phase 1b (embed): multilingual-e5-small ONNX engine promoted from the POC, with a
padded-batch encoder. Output is bit-equivalent to the Python reference (cosine
1.0000003, max elementwise diff 5e-7) and vectors are L2-normalized.

Smoke-tested on darwin/arm64 against the POC model: /healthz reports version +
digest; /v1/embed returns count*dim*4 LE-f32 bytes; oversized batch -> 413;
/v1/extract -> 501.

Refs: #597, #595, #596
The pure-Go (CGO_ENABLED=0) half of the Phase 1a contract.

Client: UDS http.Client speaking protocol v1 — Health(), Embed() (decodes the
LE-f32 blob, validates the protocol/count/dim response headers and refuses a
sidecar on an unsupported protocol version), Extract() (ready for Phase 4a).
Every request carries tenant + a generated request-id + a deadline derived from
the context.

Supervisor: lazy spawn, background health-gate with a Ready() barrier,
exponential-backoff restart that resets once a process stays up past StableFor,
and reap on context cancel (SIGTERM then SIGKILL via WaitDelay). Linux adds
Pdeathsig=SIGKILL so a hard-killed stellad never orphans the sidecar; darwin
relies on the graceful path + the next start's stale-socket probe.

Tests: client decode/validation over a unix httptest server (pure Go, runs in
CI). An env-gated integration test drives spawn -> health-gate -> embed -> reap
against a real stella-ml binary; verified locally end to end (384-dim vector,
clean shutdown, socket removed).

Refs: #597
Phase 3. The embedding lane was hardwired to a remote API: river.go gated the
whole lane on a non-empty APIKey, and the canonical vector space was always
api.SpaceKey() — so an offline, key-less deployment could not run embeddings at
all, and any non-API provider's vectors were rejected by the indexer's
canonical-space guard.

Decouple the enable-gate from the API key and let the canonical space follow the
selected provider:

- config.EmbeddingSettings gains Provider ("api" default | "local"); a pre-Provider
  config loads as "api" (backward compatible). LoadEmbeddingSettings defaults it.
- river.go resolve() branches on provider. "api" keeps the exact prior behavior
  (gate on APIKey, space = model@dim). "local" gates on a wired sidecar embedder
  instead of a key, with space = the e5-small key.
- localProvider wraps a LocalEmbedder (sidecar adapter; injected via BootConfig so
  the embedding package keeps no internal/ml dependency), Kind=local, mapping the
  query/document mode through and stamping the e5 space. A vector-count mismatch is
  Terminal (no failover).
- indexer.go is unchanged: because the space now follows the primary provider, the
  canonical-space guard accepts the local vectors instead of rejecting them.

Deferred (intentionally): "auto" cross-space failover (the risky query-degrade
piece) and the UI/API toggle, which is only meaningful once the sidecar is wired
(Phase 2) — until then selecting "local" simply disables the lane.

Tests: local provider mode/space mapping, count-mismatch-is-Terminal, resolve for
the local path, local-without-sidecar => ErrDisabled, sensitive request served
locally, provider default backward-compat. Existing API-path tests unchanged.

Refs: #597
Phase 2 foundation, mirroring internal/pgruntime. Resolves the sidecar paths
(stella-ml binary, libonnxruntime dir, e5 model, tokenizer) from two
independently versioned install roots under STELLA_HOME — a runtime bundle
(binary + onnxruntime) and a model bundle — so swapping models never re-downloads
the libraries.

A dev override (STELLA_ML_RUNTIME_DIR / STELLA_ML_MODEL_DIR) points the resolver
at a locally built sidecar + models, so local development and the supervisor
wiring need no published release. A not-found result is a clean "local ML
unavailable" signal (no error), so the feature degrades to disabled exactly like
an unconfigured lane. darwin + linux only.

The download/verify/atomic-install subcommand (mirrors postgres.go, with the
signed-manifest check from D9) is deferred until the stella-ml-runtime release
pipeline exists; the resolver already supports the installed layout it will
produce.

Tests: dev-override resolution and the not-installed path.

Refs: #597
Connects the pieces so local embedding actually runs. setupMLSidecar resolves the
ML runtime (mlruntime) and, when present, builds the supervisor + a LocalEmbedder
adapter (mapping the embedding query/document mode to the sidecar's query/passage
prefix), which is injected into the embedding lane. The supervisor is started on
the app context via the background task group, so shutdown reaps the sidecar by
cancelling parent and waiting. When no runtime is installed, both are nil and
local ML stays disabled with no error.

With STELLA_ML_RUNTIME_DIR/STELLA_ML_MODEL_DIR pointing at a locally built sidecar
+ models and the embedding provider set to "local", the full stack runs offline,
key-less.

Verified end to end (env-gated integration test, skipped in CI): resolve ->
supervisor spawn -> health-gate -> LocalEmbedder.EmbedLocal(document) returns a
384-dim e5 vector -> clean reap on cancel.

Refs: #597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant