fix(onnx): reduce retained cpu memory#212
Conversation
There was a problem hiding this comment.
Thanks for finding the issue and proposing a fix.
Overall: right shape, right flags, right call sites. The unload-lock refactor is a nice side-improvement (gc/trim run outside _kompress_lock, so heavy cleanup can't stall another thread trying to load a model). Few things worth addressing before merge:
Bugs / correctness
-
trim_process_heap()return value mismatches docstring.malloc_trim(0)returns1iff memory was actually released back to the OS,0if nothing could be released. The docstring says the function returns whether trim was available — butbool(libc.malloc_trim(0))returnsFalsewhen trim was called successfully but found nothing to release. Either:- update the docstring to "returns True iff memory was released", or
- separate "called successfully" (return
True) from the int result (log at debug).
-
Missing
argtypes/restypeonmalloc_trim. On glibc/x86_64 the defaultintreturn happens to work, but best practice is explicit:libc.malloc_trim.argtypes = [ctypes.c_size_t] libc.malloc_trim.restype = ctypes.c_int
Cheap insurance against future portability surprises.
-
except Exceptionis too broad. CatchingOSError, AttributeErroris enough — a genericExceptionhere would swallow things likeKeyboardInterruptchaining or accidental typos during refactors.
Coverage gaps
-
No test that
unload_kompress_model()actually callsgc.collect()+trim_process_heap(). This is exactly the observation from the issue (303 → 122 MiB after unload). Without a test, someone could silently drop the trim call and nobody would notice until a customer's proxy RSS drifts again. Apatch("headroom.onnx_runtime.trim_process_heap")+ one assert would pin it. -
No test for
trim_process_heap()itself — platform gate (non-Linux no-op), libc-missing path, success path with a mocked libc. One of those would have caught issue #1 above.
Consistency
-
Thread caps are inconsistent across call sites.
embedders.pypassesintra_op_num_threads=1, inter_op_num_threads=1(keeping the "avoid pthread_setaffinity_np in Docker" behavior).kompress_compressor.pyandonnx_router.pydon't — they'll inherit ORT's default, which scales with detected cores and has historically bitten us on pinned-CPU containers. Either thread the cap through everywhere, or document explicitly why Kompress/image router are safe without it. -
OnnxLocalEmbedder.close()doesn't trim. Kompress unload now trims; the embedder close sets_session = Noneand returns. The issue explicitly calls out "other ONNX session creators likely deserve the same review" — dropping agc.collect(); trim_process_heap()intoclose()would make the story symmetric.
Nits (non-blocking)
create_cpu_session_options(ort, ...)— passingortas a parameter works and makes testing with a fake easy, but forces every caller to doimport onnxruntime as ortfirst even though the helper could import it internally. Minor API ergonomics.- Module name
headroom/onnx_runtime.pyreads very close to the third-partyonnxruntimepackage.onnx_utils.pyoronnx_runtime_helpers.pywould avoid the double-take on import lines.
None of the above blocks landing the fix.
Thank you :)
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Thank you for chasing this down — memory retention in long-lived proxy processes is the kind of bug that's easy to ignore and a real pain on small VMs. Really appreciate the contribution. I pulled the branch locally and measured on macOS + onnxruntime 1.24.4 using the cached kompress-int8 model (200 variable-length inferences, each arm in a fresh subprocess so peak-RSS can't bleed across):
Latency across 300 mixed-shape inferences: mean 396 → 400ms (~1%), p50 386 → 402ms (~4%), p95 flat. The "warm runtime stayed effectively unchanged" claim holds on this workload. Two small things worth mentioning in the description so operators aren't surprised:
Thinking beyond this PR — the arena is only ~20% of the resident set. A rough decomposition of the 456 MB:
A few directions for follow-up PRs, from lowest effort / highest leverage:
Maybe a follow-up "low-memory deployment profile" PR that packages 1–4 behind a single Thanks again — this is solid work and the approach of making it a shared helper makes follow-ups much easier. |
|
Hi, Thanks for you comments. I reran the Docker repro and posted the detailed numbers on issue #211. Short version: on my Linux/Docker run, this PR materially improves retained RSS after ONNX-heavy Kompress activity, and repeated load/infer/unload cycles stayed bounded. I also fixed the failing Python 3.12 CI job on this branch:
Your review notes make sense as next iteration items:
So at this point:
|
Fixes #211
Summary
This PR addresses retained CPU memory in Headroom’s ONNX-backed paths for long-running proxy processes.
During investigation, the proxy’s tracked memory stayed low, but process RSS / anonymous memory remained much higher after ONNX work completed. The strongest local culprit was ONNX Runtime session memory
retention, especially around the Kompress path, with the same pattern also relevant to other ONNX session users.
The goal of this change is to make ONNX-backed inference behave more predictably in a long-lived Headroom process, especially on smaller VMs, without changing the proxy’s concurrency model or request
routing behavior.
What changed
SessionOptionsenable_cpu_mem_arena = Falseenable_mem_pattern = Falseheadroom/transforms/kompress_compressor.pyheadroom/memory/adapters/embedders.pyheadroom/image/onnx_router.pyunload_kompress_model()to do a more complete cleanup:gc.collect()malloc_trim(0)when availableWhy
Headroom is a long-lived proxy process. In that kind of runtime, retaining large anonymous RSS after ONNX-heavy work is more harmful than in a short-lived batch process.
This change biases ONNX Runtime setup toward lower retained memory and better long-run stability rather than maximum reuse of CPU-side ONNX allocations.
Validation
I verified this locally with:
/readyzhealthy/healthhealthyLocal isolated measurements also showed materially lower retained RSS after ONNX-heavy work, while warm runtime stayed effectively unchanged.
Notes
This PR is intended as a bug fix for memory retention / stability, not as a new feature. It should improve behavior on memory-constrained machines without introducing a meaningful performance regression
in normal proxy operation.