perf(serving): lazy encoder loading - models load on first use, [hardware] lazy_encoders knob (#219 slice 2)#220
Merged
Conversation
…ware] lazy_encoders knob (#219 slice 2) A serving process at 829K genes ran 20.3 GB RSS because the encoder stack loaded eagerly on the import/__init__ path: every process (tray backend, bench server, each build worker) paid the full stack whether or not the workload touched it — the #176/#191 3-CUDA-context WDDM spill incident class. Eager sites found and deferred (council Option-A rider): - SemaCodec in HelixContextManager.__init__ — constructed the MiniLM sentence-transformer + 20-anchor projection at boot. Now a LazySemaCodec proxy (backends/sema.py): availability probed via find_spec (no import); construction happens on the first encode under a double-checked lock; pure-math statics (similarity/nearest) pass through without forcing a load; a construction failure is cached and degrades to "SEMA disabled" exactly like the old eager except-path. - DeBERTaRibosome in HelixContextManager.__init__ (backend=deberta) — two DeBERTa-v3 model loads at boot. Now a LazyRibosome proxy that builds on the first re_rank/splice/classify access with byte-identical ctor args; factory failure permanently falls back to the disabled ribosome (the old except-branch end state); private/dunder lookups never materialize, so introspection stays load-free. Verified already-lazy and left untouched: - SPLADE (module globals load inside _ensure_loaded on first encode), BGE-M3 (BGEM3Codec._load + get_shared_codec are first-use behind the instance lock), spaCy (tagger._get_nlp on first pack; CpuTagger ctor is light), NLI (lazy inside DeBERTaRibosome). - scripts/backfill_* and build_fixture_matrix workers keep their intentional eager loads — scope is the serving path only. [hardware] lazy_encoders = true (default) arms the lazy proxies; false restores the pre-slice eager warmup at manager init for operators who want first-query latency paid at boot. GET /admin/components now reports per-component loaded-ness WITHOUT forcing a load: "idle (not loaded)" vs running/idle plus an explicit "loaded" bool, probing only lazy-proxy peek()/loaded, the spaCy/SPLADE module globals, and BGEM3Codec._model. Also surfaces the BGE-M3 dense codec as a component and no longer crashes on a backend-less DeBERTa ribosome. Expected impact: idle serving RSS drops by the resident encoder stack per process (MiniLM + projection; DeBERTa pair when configured), and multi-process deployments (tray + bench + workers) stop multiplying it; behavior after first use is identical by construction. Tests: tests/test_lazy_encoders.py — 14 tests, monkeypatched counting constructors, no models: init constructs nothing; first use constructs exactly once; 8-thread concurrent first use constructs once; lazy_encoders=false restores eager init; /admin/components reports unloaded state without triggering loads. Full non-live sweep green except 4 pre-existing environmental failures reproduced on pristine HEAD (test_build_fixture_matrix_parallel x2, batched_in x2).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Epic #219 slice 2 (council Option-A rider). A serving process at 829K genes ran 20.3 GB RSS with the encoder stack loaded eagerly per process (tray backend + bench server + workers each paying it - the #176/#191 3-CUDA-context incident class). Two eager sites found on the serving path and proxied: SemaCodec in HelixContextManager.init (LazySemaCodec: find_spec probe, double-checked-lock first-use construction, failure degrades to SEMA-disabled exactly like the legacy except path) and DeBERTaRibosome (LazyRibosome, byte-identical ctor args on first re_rank/splice). Verified already-lazy and left alone: SPLADE, BGE-M3 shared codec, spaCy tagger, NLI. [hardware] lazy_encoders = true (false restores eager boot warmup). /admin/components now reports idle-(not-loaded) vs loaded via non-forcing probes and includes the dense codec. 14 new tests (counting-fake ctors: init builds nothing, first use builds once, 8-thread race builds once, knob-off eager, panel non-forcing); 103 passed locally across lazy+server+config; ~2,240 sandbox sweep clean (4 pre-existing env failures).