✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar#598
Draft
vaayne wants to merge 10 commits into
Draft
✨ feat(ml): built-in local OCR + embedding via a native stella-ml sidecar#598vaayne wants to merge 10 commits into
vaayne wants to merge 10 commits into
Conversation
Standalone POC module proving a Go-native OCR engine: PP-OCRv5 mobile det/rec ONNX models via onnxruntime_go, with pure-Go preprocessing, DB postprocess (connected components), and CTC decode. End-to-end image to text works on darwin/arm64. Isolated module (own go.mod) so it does not touch the stella build. onnxruntime lib, models, and binary are gitignored; README documents sources and reproduction. Refs: design notes in this branch; not yet tied to an issue/PR.
Go-native text embeddings: intfloat/multilingual-e5-small via onnxruntime_go, with exact XLM-RoBERTa tokenization through daulet/tokenizers (HF Rust lib, statically linked). Mean-pool + L2 normalize, e5 query/passage prefixes. Verified bit-equivalent to a Python reference (same ONNX + HF tokenizer): cosine 0.99999963, max diff 5e-7. int8 quant keeps cosine 0.99897 vs fp32 at 113M (vs 448M fp32) -> recommended built-in variant. Separate module; models, libtokenizers.a, onnxruntime, and binary gitignored. Refs: not yet tied to an issue/PR.
Proves the CGO sidecar (onnxruntime_go + daulet/tokenizers) builds for darwin+linux from a single macOS host. Findings: - onnxruntime_go is dlopen-only -> no lib needed at build time. - darwin: native clang cross-compiles both arches (zig can't find macOS system libs without a sysroot). - linux: zig cc + a 3-line no-op shim for two GNU libstdc++ iostream-init symbols that esaxx.cpp pulls in but never uses. linux/amd64 binary runs the real tokenizer in docker (ok 1 false). - windows dropped (no prebuilt libtokenizers; decision). Production note: build linux sidecar natively in CI rather than rely on the shim. Spike stays the local-dev cross-build path. Refs #595 #596
Phase 1a/1b of the native ML sidecar. A separate CGO module hosts onnxruntime + the HF tokenizer + models and serves them to the pure-Go stellad over an HTTP-on-unix-socket contract. stellad stays CGO_ENABLED=0; the parent module's `go build ./...` skips this nested module automatically. Phase 1a (contract): versioned protocol (/healthz, /v1/embed, /v1/extract), per-request identity headers (tenant/request-id/deadline) fixed up front so the fairness layer never retrofits identity; separate per-endpoint lanes + per-tenant in-flight cap so one tenant's load can't starve another; request caps; pinned ORT thread counts; 0700 socket dir / 0600 socket with stale-socket probe; graceful shutdown on signal. /v1/extract returns 501 until the OCR engine lands (4a). Phase 1b (embed): multilingual-e5-small ONNX engine promoted from the POC, with a padded-batch encoder. Output is bit-equivalent to the Python reference (cosine 1.0000003, max elementwise diff 5e-7) and vectors are L2-normalized. Smoke-tested on darwin/arm64 against the POC model: /healthz reports version + digest; /v1/embed returns count*dim*4 LE-f32 bytes; oversized batch -> 413; /v1/extract -> 501. Refs: #597, #595, #596
The pure-Go (CGO_ENABLED=0) half of the Phase 1a contract. Client: UDS http.Client speaking protocol v1 — Health(), Embed() (decodes the LE-f32 blob, validates the protocol/count/dim response headers and refuses a sidecar on an unsupported protocol version), Extract() (ready for Phase 4a). Every request carries tenant + a generated request-id + a deadline derived from the context. Supervisor: lazy spawn, background health-gate with a Ready() barrier, exponential-backoff restart that resets once a process stays up past StableFor, and reap on context cancel (SIGTERM then SIGKILL via WaitDelay). Linux adds Pdeathsig=SIGKILL so a hard-killed stellad never orphans the sidecar; darwin relies on the graceful path + the next start's stale-socket probe. Tests: client decode/validation over a unix httptest server (pure Go, runs in CI). An env-gated integration test drives spawn -> health-gate -> embed -> reap against a real stella-ml binary; verified locally end to end (384-dim vector, clean shutdown, socket removed). Refs: #597
Phase 3. The embedding lane was hardwired to a remote API: river.go gated the
whole lane on a non-empty APIKey, and the canonical vector space was always
api.SpaceKey() — so an offline, key-less deployment could not run embeddings at
all, and any non-API provider's vectors were rejected by the indexer's
canonical-space guard.
Decouple the enable-gate from the API key and let the canonical space follow the
selected provider:
- config.EmbeddingSettings gains Provider ("api" default | "local"); a pre-Provider
config loads as "api" (backward compatible). LoadEmbeddingSettings defaults it.
- river.go resolve() branches on provider. "api" keeps the exact prior behavior
(gate on APIKey, space = model@dim). "local" gates on a wired sidecar embedder
instead of a key, with space = the e5-small key.
- localProvider wraps a LocalEmbedder (sidecar adapter; injected via BootConfig so
the embedding package keeps no internal/ml dependency), Kind=local, mapping the
query/document mode through and stamping the e5 space. A vector-count mismatch is
Terminal (no failover).
- indexer.go is unchanged: because the space now follows the primary provider, the
canonical-space guard accepts the local vectors instead of rejecting them.
Deferred (intentionally): "auto" cross-space failover (the risky query-degrade
piece) and the UI/API toggle, which is only meaningful once the sidecar is wired
(Phase 2) — until then selecting "local" simply disables the lane.
Tests: local provider mode/space mapping, count-mismatch-is-Terminal, resolve for
the local path, local-without-sidecar => ErrDisabled, sensitive request served
locally, provider default backward-compat. Existing API-path tests unchanged.
Refs: #597
Phase 2 foundation, mirroring internal/pgruntime. Resolves the sidecar paths (stella-ml binary, libonnxruntime dir, e5 model, tokenizer) from two independently versioned install roots under STELLA_HOME — a runtime bundle (binary + onnxruntime) and a model bundle — so swapping models never re-downloads the libraries. A dev override (STELLA_ML_RUNTIME_DIR / STELLA_ML_MODEL_DIR) points the resolver at a locally built sidecar + models, so local development and the supervisor wiring need no published release. A not-found result is a clean "local ML unavailable" signal (no error), so the feature degrades to disabled exactly like an unconfigured lane. darwin + linux only. The download/verify/atomic-install subcommand (mirrors postgres.go, with the signed-manifest check from D9) is deferred until the stella-ml-runtime release pipeline exists; the resolver already supports the installed layout it will produce. Tests: dev-override resolution and the not-installed path. Refs: #597
Connects the pieces so local embedding actually runs. setupMLSidecar resolves the ML runtime (mlruntime) and, when present, builds the supervisor + a LocalEmbedder adapter (mapping the embedding query/document mode to the sidecar's query/passage prefix), which is injected into the embedding lane. The supervisor is started on the app context via the background task group, so shutdown reaps the sidecar by cancelling parent and waiting. When no runtime is installed, both are nil and local ML stays disabled with no error. With STELLA_ML_RUNTIME_DIR/STELLA_ML_MODEL_DIR pointing at a locally built sidecar + models and the embedding provider set to "local", the full stack runs offline, key-less. Verified end to end (env-gated integration test, skipped in CI): resolve -> supervisor spawn -> health-gate -> LocalEmbedder.EmbedLocal(document) returns a 384-dim e5 vector -> clean reap on cancel. Refs: #597
This was referenced Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Built-in, offline, private OCR (PaddleOCR PP-OCRv5 ONNX) and embedding
(
intfloat/multilingual-e5-smallONNX), hosted in a single native CGO sidecar(
stella-ml) that theCGO_ENABLED=0main binary drives over a unix-socketcontract. Native runtime + models are runtime-downloaded, mirroring the embedded
Postgres runtime.
Why
internal/embeddinghas a local/privacy lane but no local provider — privacyembedding fails instead of running offline.
internal/document(✨ feat: add native document extraction seam #596) routes thereadtool through anExtractorseam butthe only non-vision backend shells out to
kreuzberg.How
One CGO sidecar hosts onnxruntime + HF tokenizers + models; stellad stays pure-Go
and is a UDS client + supervisor. Phased (per the
native-ml-sidecarspec plan):pgruntime)Constraints: darwin + linux only (no prebuilt
libtokenizersfor windows); mainbinary stays
CGO_ENABLED=0; the sidecar is the only CGO artifact (a nested module,so the parent's
go build ./...skips it).Landed so far
cmd/stella-ml/— nested CGO module: versioned HTTP-over-UDS (/healthz,/v1/embed,/v1/extractstub), per-tenant fairness lanes, request caps, pinnedORT threads, 0700/0600 socket with stale-socket probe, graceful shutdown. e5
batch encoder; output bit-equivalent to the Python reference (cosine 1.0000003).
internal/ml/— pure-Go client (decodes the f32 blob, validates the protocolhandshake) + supervisor (lazy spawn, health-gate, backoff restart, reap;
Pdeathsig on linux). Client unit tests + an env-gated real-binary integration
test.
Refs
document.Extractorseam)