Skip to content

perf(serving): lazy encoder loading - models load on first use, [hardware] lazy_encoders knob (#219 slice 2)#220

Merged
mbachaud merged 1 commit into
masterfrom
perf/lazy-encoder-loading
Jun 12, 2026
Merged

perf(serving): lazy encoder loading - models load on first use, [hardware] lazy_encoders knob (#219 slice 2)#220
mbachaud merged 1 commit into
masterfrom
perf/lazy-encoder-loading

Conversation

@mbachaud

Copy link
Copy Markdown
Owner

Epic #219 slice 2 (council Option-A rider). A serving process at 829K genes ran 20.3 GB RSS with the encoder stack loaded eagerly per process (tray backend + bench server + workers each paying it - the #176/#191 3-CUDA-context incident class). Two eager sites found on the serving path and proxied: SemaCodec in HelixContextManager.init (LazySemaCodec: find_spec probe, double-checked-lock first-use construction, failure degrades to SEMA-disabled exactly like the legacy except path) and DeBERTaRibosome (LazyRibosome, byte-identical ctor args on first re_rank/splice). Verified already-lazy and left alone: SPLADE, BGE-M3 shared codec, spaCy tagger, NLI. [hardware] lazy_encoders = true (false restores eager boot warmup). /admin/components now reports idle-(not-loaded) vs loaded via non-forcing probes and includes the dense codec. 14 new tests (counting-fake ctors: init builds nothing, first use builds once, 8-thread race builds once, knob-off eager, panel non-forcing); 103 passed locally across lazy+server+config; ~2,240 sandbox sweep clean (4 pre-existing env failures).

…ware] lazy_encoders knob (#219 slice 2)

A serving process at 829K genes ran 20.3 GB RSS because the encoder
stack loaded eagerly on the import/__init__ path: every process (tray
backend, bench server, each build worker) paid the full stack whether
or not the workload touched it — the #176/#191 3-CUDA-context WDDM
spill incident class.

Eager sites found and deferred (council Option-A rider):

- SemaCodec in HelixContextManager.__init__ — constructed the MiniLM
  sentence-transformer + 20-anchor projection at boot. Now a
  LazySemaCodec proxy (backends/sema.py): availability probed via
  find_spec (no import); construction happens on the first encode under
  a double-checked lock; pure-math statics (similarity/nearest) pass
  through without forcing a load; a construction failure is cached and
  degrades to "SEMA disabled" exactly like the old eager except-path.
- DeBERTaRibosome in HelixContextManager.__init__ (backend=deberta) —
  two DeBERTa-v3 model loads at boot. Now a LazyRibosome proxy that
  builds on the first re_rank/splice/classify access with byte-identical
  ctor args; factory failure permanently falls back to the disabled
  ribosome (the old except-branch end state); private/dunder lookups
  never materialize, so introspection stays load-free.

Verified already-lazy and left untouched:
- SPLADE (module globals load inside _ensure_loaded on first encode),
  BGE-M3 (BGEM3Codec._load + get_shared_codec are first-use behind the
  instance lock), spaCy (tagger._get_nlp on first pack; CpuTagger ctor
  is light), NLI (lazy inside DeBERTaRibosome).
- scripts/backfill_* and build_fixture_matrix workers keep their
  intentional eager loads — scope is the serving path only.

[hardware] lazy_encoders = true (default) arms the lazy proxies; false
restores the pre-slice eager warmup at manager init for operators who
want first-query latency paid at boot.

GET /admin/components now reports per-component loaded-ness WITHOUT
forcing a load: "idle (not loaded)" vs running/idle plus an explicit
"loaded" bool, probing only lazy-proxy peek()/loaded, the spaCy/SPLADE
module globals, and BGEM3Codec._model. Also surfaces the BGE-M3 dense
codec as a component and no longer crashes on a backend-less DeBERTa
ribosome.

Expected impact: idle serving RSS drops by the resident encoder stack
per process (MiniLM + projection; DeBERTa pair when configured), and
multi-process deployments (tray + bench + workers) stop multiplying it;
behavior after first use is identical by construction.

Tests: tests/test_lazy_encoders.py — 14 tests, monkeypatched counting
constructors, no models: init constructs nothing; first use constructs
exactly once; 8-thread concurrent first use constructs once;
lazy_encoders=false restores eager init; /admin/components reports
unloaded state without triggering loads. Full non-live sweep green
except 4 pre-existing environmental failures reproduced on pristine
HEAD (test_build_fixture_matrix_parallel x2, batched_in x2).
@mbachaud mbachaud merged commit 3aa6c5c into master Jun 12, 2026
3 checks passed
@mbachaud mbachaud deleted the perf/lazy-encoder-loading branch June 12, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant