perf(serving): lazy encoder loading - models load on first use, [hardware] lazy_encoders knob (#219 slice 2) by mbachaud · Pull Request #220 · mbachaud/helix-context

mbachaud · 2026-06-12T22:15:37Z

Epic #219 slice 2 (council Option-A rider). A serving process at 829K genes ran 20.3 GB RSS with the encoder stack loaded eagerly per process (tray backend + bench server + workers each paying it - the #176/#191 3-CUDA-context incident class). Two eager sites found on the serving path and proxied: SemaCodec in HelixContextManager.init (LazySemaCodec: find_spec probe, double-checked-lock first-use construction, failure degrades to SEMA-disabled exactly like the legacy except path) and DeBERTaRibosome (LazyRibosome, byte-identical ctor args on first re_rank/splice). Verified already-lazy and left alone: SPLADE, BGE-M3 shared codec, spaCy tagger, NLI. [hardware] lazy_encoders = true (false restores eager boot warmup). /admin/components now reports idle-(not-loaded) vs loaded via non-forcing probes and includes the dense codec. 14 new tests (counting-fake ctors: init builds nothing, first use builds once, 8-thread race builds once, knob-off eager, panel non-forcing); 103 passed locally across lazy+server+config; ~2,240 sandbox sweep clean (4 pre-existing env failures).

…ware] lazy_encoders knob (#219 slice 2) A serving process at 829K genes ran 20.3 GB RSS because the encoder stack loaded eagerly on the import/__init__ path: every process (tray backend, bench server, each build worker) paid the full stack whether or not the workload touched it — the #176/#191 3-CUDA-context WDDM spill incident class. Eager sites found and deferred (council Option-A rider): - SemaCodec in HelixContextManager.__init__ — constructed the MiniLM sentence-transformer + 20-anchor projection at boot. Now a LazySemaCodec proxy (backends/sema.py): availability probed via find_spec (no import); construction happens on the first encode under a double-checked lock; pure-math statics (similarity/nearest) pass through without forcing a load; a construction failure is cached and degrades to "SEMA disabled" exactly like the old eager except-path. - DeBERTaRibosome in HelixContextManager.__init__ (backend=deberta) — two DeBERTa-v3 model loads at boot. Now a LazyRibosome proxy that builds on the first re_rank/splice/classify access with byte-identical ctor args; factory failure permanently falls back to the disabled ribosome (the old except-branch end state); private/dunder lookups never materialize, so introspection stays load-free. Verified already-lazy and left untouched: - SPLADE (module globals load inside _ensure_loaded on first encode), BGE-M3 (BGEM3Codec._load + get_shared_codec are first-use behind the instance lock), spaCy (tagger._get_nlp on first pack; CpuTagger ctor is light), NLI (lazy inside DeBERTaRibosome). - scripts/backfill_* and build_fixture_matrix workers keep their intentional eager loads — scope is the serving path only. [hardware] lazy_encoders = true (default) arms the lazy proxies; false restores the pre-slice eager warmup at manager init for operators who want first-query latency paid at boot. GET /admin/components now reports per-component loaded-ness WITHOUT forcing a load: "idle (not loaded)" vs running/idle plus an explicit "loaded" bool, probing only lazy-proxy peek()/loaded, the spaCy/SPLADE module globals, and BGEM3Codec._model. Also surfaces the BGE-M3 dense codec as a component and no longer crashes on a backend-less DeBERTa ribosome. Expected impact: idle serving RSS drops by the resident encoder stack per process (MiniLM + projection; DeBERTa pair when configured), and multi-process deployments (tray + bench + workers) stop multiplying it; behavior after first use is identical by construction. Tests: tests/test_lazy_encoders.py — 14 tests, monkeypatched counting constructors, no models: init constructs nothing; first use constructs exactly once; 8-thread concurrent first use constructs once; lazy_encoders=false restores eager init; /admin/components reports unloaded state without triggering loads. Full non-live sweep green except 4 pre-existing environmental failures reproduced on pristine HEAD (test_build_fixture_matrix_parallel x2, batched_in x2).

mbachaud merged commit 3aa6c5c into master Jun 12, 2026
3 checks passed

mbachaud deleted the perf/lazy-encoder-loading branch June 12, 2026 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(serving): lazy encoder loading - models load on first use, [hardware] lazy_encoders knob (#219 slice 2)#220

perf(serving): lazy encoder loading - models load on first use, [hardware] lazy_encoders knob (#219 slice 2)#220
mbachaud merged 1 commit into
masterfrom
perf/lazy-encoder-loading

mbachaud commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbachaud commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant