fix(onnx): reduce retained cpu memory by Kayzo · Pull Request #212 · chopratejas/headroom

Kayzo · 2026-04-20T22:55:47Z

Fixes #211

Summary

This PR addresses retained CPU memory in Headroom’s ONNX-backed paths for long-running proxy processes.

During investigation, the proxy’s tracked memory stayed low, but process RSS / anonymous memory remained much higher after ONNX work completed. The strongest local culprit was ONNX Runtime session memory
retention, especially around the Kompress path, with the same pattern also relevant to other ONNX session users.

The goal of this change is to make ONNX-backed inference behave more predictably in a long-lived Headroom process, especially on smaller VMs, without changing the proxy’s concurrency model or request
routing behavior.

What changed

added a shared ONNX Runtime helper for CPU-oriented SessionOptions
disabled ORT CPU memory retention features during session creation:
- enable_cpu_mem_arena = False
- enable_mem_pattern = False
applied those options to the ONNX-backed paths in:
- headroom/transforms/kompress_compressor.py
- headroom/memory/adapters/embedders.py
- headroom/image/onnx_router.py
improved unload_kompress_model() to do a more complete cleanup:
- gc.collect()
- Linux heap trimming via malloc_trim(0) when available
added focused tests for the new ONNX session option helper

Why

Headroom is a long-lived proxy process. In that kind of runtime, retaining large anonymous RSS after ONNX-heavy work is more harmful than in a short-lived batch process.

This change biases ONNX Runtime setup toward lower retained memory and better long-run stability rather than maximum reuse of CPU-side ONNX allocations.

Validation

I verified this locally with:

focused tests for the ONNX helper
related compression/proxy tests
local rebuild/restart of the Headroom container from this branch
health checks after restart:
- /readyz healthy
- /health healthy
startup logs showed no tracebacks/errors
confirmed the running image contains the new ONNX helper and updated unload path

Local isolated measurements also showed materially lower retained RSS after ONNX-heavy work, while warm runtime stayed effectively unchanged.

Notes

This PR is intended as a bug fix for memory retention / stability, not as a new feature. It should improve behavior on memory-constrained machines without introducing a meaningful performance regression
in normal proxy operation.

chopratejas

Thanks for finding the issue and proposing a fix.

Overall: right shape, right flags, right call sites. The unload-lock refactor is a nice side-improvement (gc/trim run outside _kompress_lock, so heavy cleanup can't stall another thread trying to load a model). Few things worth addressing before merge:

Bugs / correctness

trim_process_heap() return value mismatches docstring. malloc_trim(0) returns 1 iff memory was actually released back to the OS, 0 if nothing could be released. The docstring says the function returns whether trim was available — but bool(libc.malloc_trim(0)) returns False when trim was called successfully but found nothing to release. Either:
- update the docstring to "returns True iff memory was released", or
- separate "called successfully" (return True) from the int result (log at debug).
Missing argtypes / restype on malloc_trim. On glibc/x86_64 the default int return happens to work, but best practice is explicit:
```
libc.malloc_trim.argtypes = [ctypes.c_size_t]
libc.malloc_trim.restype = ctypes.c_int
```
Cheap insurance against future portability surprises.
except Exception is too broad. Catching OSError, AttributeError is enough — a generic Exception here would swallow things like KeyboardInterrupt chaining or accidental typos during refactors.

Coverage gaps

No test that unload_kompress_model() actually calls gc.collect() + trim_process_heap(). This is exactly the observation from the issue (303 → 122 MiB after unload). Without a test, someone could silently drop the trim call and nobody would notice until a customer's proxy RSS drifts again. A patch("headroom.onnx_runtime.trim_process_heap") + one assert would pin it.
No test for trim_process_heap() itself — platform gate (non-Linux no-op), libc-missing path, success path with a mocked libc. One of those would have caught issue #1 above.

Consistency

Thread caps are inconsistent across call sites. embedders.py passes intra_op_num_threads=1, inter_op_num_threads=1 (keeping the "avoid pthread_setaffinity_np in Docker" behavior). kompress_compressor.py and onnx_router.py don't — they'll inherit ORT's default, which scales with detected cores and has historically bitten us on pinned-CPU containers. Either thread the cap through everywhere, or document explicitly why Kompress/image router are safe without it.
OnnxLocalEmbedder.close() doesn't trim. Kompress unload now trims; the embedder close sets _session = None and returns. The issue explicitly calls out "other ONNX session creators likely deserve the same review" — dropping a gc.collect(); trim_process_heap() into close() would make the story symmetric.

Nits (non-blocking)

create_cpu_session_options(ort, ...) — passing ort as a parameter works and makes testing with a fake easy, but forces every caller to do import onnxruntime as ort first even though the helper could import it internally. Minor API ergonomics.
Module name headroom/onnx_runtime.py reads very close to the third-party onnxruntime package. onnx_utils.py or onnx_runtime_helpers.py would avoid the double-take on import lines.

None of the above blocks landing the fix.

Thank you :)

codecov · 2026-04-20T23:06:22Z

Codecov Report

❌ Patch coverage is 47.61905% with 22 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
headroom/onnx_runtime.py	61.53%	10 Missing ⚠️
headroom/transforms/kompress_compressor.py	16.66%	10 Missing ⚠️
headroom/image/onnx_router.py	50.00%	1 Missing ⚠️
headroom/memory/adapters/embedders.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

chopratejas · 2026-04-21T04:26:00Z

Thank you for chasing this down — memory retention in long-lived proxy processes is the kind of bug that's easy to ignore and a real pain on small VMs. Really appreciate the contribution.

I pulled the branch locally and measured on macOS + onnxruntime 1.24.4 using the cached kompress-int8 model (200 variable-length inferences, each arm in a fresh subprocess so peak-RSS can't bleed across):

	arena=ON (main)	arena=OFF (this PR)	Δ
loaded	391 MB	456 MB	+64 MB
peak during infer	438 MB	557 MB	+119 MB
after session drop	438 MB	327 MB	−110 MB
retained vs baseline	390 MB	276 MB	−113 MB / −29%

Latency across 300 mixed-shape inferences: mean 396 → 400ms (~1%), p50 386 → 402ms (~4%), p95 flat. The "warm runtime stayed effectively unchanged" claim holds on this workload.

Two small things worth mentioning in the description so operators aren't surprised:

Peak RSS during inference goes up ~27% (438 → 557 MB here). Retention drops, but there's a real trade on memory-pressured boxes. Worth calling out explicitly.
In kompress_compressor._load_kompress_onnx, the old call was ort.InferenceSession(onnx_path) (ORT picks the best available provider). The new call pins providers=["CPUExecutionProvider"], which silently changes behavior for anyone who had onnxruntime-gpu installed. Probably the intended behavior, but worth making it explicit.

Thinking beyond this PR — the arena is only ~20% of the resident set. A rough decomposition of the 456 MB:

Python + imports: ~50 MB
ORT C++ binary: ~80–120 MB
Kompress INT8 weights: ~200–260 MB ← the real elephant
Graph optimizer temporaries: ~40–80 MB
Tokenizer + vocab: ~30–60 MB
CPU arena (this PR): ~100 MB retained

A few directions for follow-up PRs, from lowest effort / highest leverage:

MALLOC_ARENA_MAX=2 + jemalloc LD_PRELOAD in the Dockerfile / systemd unit. Stacks with this PR; typically another 15–35% RSS reduction for long-lived Python processes. Pure deployment change, no code.
ORT session.use_device_allocator_for_initializers=1 + optional ORT_ENABLE_BASIC graph opt via HEADROOM_ONNX_LOW_MEM=1. Small, free-ish wins you could add to the new helper.
Lazy-load Kompress on first eligible request instead of at proxy startup. For proxies where most traffic doesn't trigger ML compression, this is ~400 MB saved per process until actually needed.
Idle unload + RSS-based backpressure, leaning on the trim_process_heap() helper you just added. Gives ops an actual memory SLO.
Distill a smaller Kompress (6-layer MiniLM-style, ~25 MB INT8). Biggest hammer — ~250 MB off resident RAM — but a real project. Worth scoping if "Headroom on a small VM" becomes a first-class goal.

Maybe a follow-up "low-memory deployment profile" PR that packages 1–4 behind a single HEADROOM_LOW_MEMORY=1 env + a short deployment doc, reusing the helpers from this one?

Thanks again — this is solid work and the approach of making it a shared helper makes follow-ups much easier.

Kayzo · 2026-04-21T20:01:33Z

Hi,

Thanks for you comments. I reran the Docker repro and posted the detailed numbers on issue #211.

Short version: on my Linux/Docker run, this PR materially improves retained RSS after ONNX-heavy Kompress activity, and repeated load/infer/unload cycles stayed bounded.

I also fixed the failing Python 3.12 CI job on this branch:

the failure was ruff format --check .
I pushed a formatter-only follow-up: 1961ceb
verified locally:
- ruff check . ✅
- ruff format --check . ✅
- targeted proxy tests ✅

Your review notes make sense as next iteration items:

clarify trim_process_heap() return semantics
add explicit malloc_trim argtypes/restype
narrow exception handling
add trim/unload coverage
keep thread caps consistent across ONNX call sites
make OnnxLocalEmbedder.close() symmetrical

So at this point:

the branch is green for the formatter issue
the Docker repro supports the memory-retention fix
your review comments still look like the right follow-up improvements

fix(onnx): reduce retained cpu memory

fbbde51

chopratejas reviewed Apr 20, 2026

View reviewed changes

chore: format proxy route tests with ruff

1961ceb

Kayzo mentioned this pull request Apr 21, 2026

[BUG] ONNX Runtime paths retain large anonymous RSS in long-running Headroom proxy processes #211

Closed

chopratejas merged commit 80920ed into chopratejas:main Apr 21, 2026
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(onnx): reduce retained cpu memory#212

fix(onnx): reduce retained cpu memory#212
chopratejas merged 2 commits intochopratejas:mainfrom
Kayzo:fix/onnx-memory-retention

Kayzo commented Apr 20, 2026

Uh oh!

chopratejas left a comment •

edited

Loading

Uh oh!

codecov Bot commented Apr 20, 2026

Uh oh!

chopratejas commented Apr 21, 2026

Uh oh!

Kayzo commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Kayzo commented Apr 20, 2026

Summary

What changed

Why

Validation

Notes

Uh oh!

chopratejas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Bugs / correctness

Coverage gaps

Consistency

Nits (non-blocking)

Uh oh!

codecov Bot commented Apr 20, 2026

Codecov Report

Uh oh!

chopratejas commented Apr 21, 2026

Uh oh!

Kayzo commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chopratejas left a comment •

edited

Loading