Skip to content

fix(onnx): reduce retained cpu memory#212

Merged
chopratejas merged 2 commits intochopratejas:mainfrom
Kayzo:fix/onnx-memory-retention
Apr 21, 2026
Merged

fix(onnx): reduce retained cpu memory#212
chopratejas merged 2 commits intochopratejas:mainfrom
Kayzo:fix/onnx-memory-retention

Conversation

@Kayzo
Copy link
Copy Markdown
Contributor

@Kayzo Kayzo commented Apr 20, 2026

Fixes #211

Summary

This PR addresses retained CPU memory in Headroom’s ONNX-backed paths for long-running proxy processes.

During investigation, the proxy’s tracked memory stayed low, but process RSS / anonymous memory remained much higher after ONNX work completed. The strongest local culprit was ONNX Runtime session memory
retention, especially around the Kompress path, with the same pattern also relevant to other ONNX session users.

The goal of this change is to make ONNX-backed inference behave more predictably in a long-lived Headroom process, especially on smaller VMs, without changing the proxy’s concurrency model or request
routing behavior.

What changed

  • added a shared ONNX Runtime helper for CPU-oriented SessionOptions
  • disabled ORT CPU memory retention features during session creation:
    • enable_cpu_mem_arena = False
    • enable_mem_pattern = False
  • applied those options to the ONNX-backed paths in:
    • headroom/transforms/kompress_compressor.py
    • headroom/memory/adapters/embedders.py
    • headroom/image/onnx_router.py
  • improved unload_kompress_model() to do a more complete cleanup:
    • gc.collect()
    • Linux heap trimming via malloc_trim(0) when available
  • added focused tests for the new ONNX session option helper

Why

Headroom is a long-lived proxy process. In that kind of runtime, retaining large anonymous RSS after ONNX-heavy work is more harmful than in a short-lived batch process.

This change biases ONNX Runtime setup toward lower retained memory and better long-run stability rather than maximum reuse of CPU-side ONNX allocations.

Validation

I verified this locally with:

  • focused tests for the ONNX helper
  • related compression/proxy tests
  • local rebuild/restart of the Headroom container from this branch
  • health checks after restart:
    • /readyz healthy
    • /health healthy
  • startup logs showed no tracebacks/errors
  • confirmed the running image contains the new ONNX helper and updated unload path

Local isolated measurements also showed materially lower retained RSS after ONNX-heavy work, while warm runtime stayed effectively unchanged.

Notes

This PR is intended as a bug fix for memory retention / stability, not as a new feature. It should improve behavior on memory-constrained machines without introducing a meaningful performance regression
in normal proxy operation.

Copy link
Copy Markdown
Owner

@chopratejas chopratejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the issue and proposing a fix.

Overall: right shape, right flags, right call sites. The unload-lock refactor is a nice side-improvement (gc/trim run outside _kompress_lock, so heavy cleanup can't stall another thread trying to load a model). Few things worth addressing before merge:

Bugs / correctness

  1. trim_process_heap() return value mismatches docstring. malloc_trim(0) returns 1 iff memory was actually released back to the OS, 0 if nothing could be released. The docstring says the function returns whether trim was available — but bool(libc.malloc_trim(0)) returns False when trim was called successfully but found nothing to release. Either:

    • update the docstring to "returns True iff memory was released", or
    • separate "called successfully" (return True) from the int result (log at debug).
  2. Missing argtypes / restype on malloc_trim. On glibc/x86_64 the default int return happens to work, but best practice is explicit:

    libc.malloc_trim.argtypes = [ctypes.c_size_t]
    libc.malloc_trim.restype = ctypes.c_int

    Cheap insurance against future portability surprises.

  3. except Exception is too broad. Catching OSError, AttributeError is enough — a generic Exception here would swallow things like KeyboardInterrupt chaining or accidental typos during refactors.

Coverage gaps

  1. No test that unload_kompress_model() actually calls gc.collect() + trim_process_heap(). This is exactly the observation from the issue (303 → 122 MiB after unload). Without a test, someone could silently drop the trim call and nobody would notice until a customer's proxy RSS drifts again. A patch("headroom.onnx_runtime.trim_process_heap") + one assert would pin it.

  2. No test for trim_process_heap() itself — platform gate (non-Linux no-op), libc-missing path, success path with a mocked libc. One of those would have caught issue #1 above.

Consistency

  1. Thread caps are inconsistent across call sites. embedders.py passes intra_op_num_threads=1, inter_op_num_threads=1 (keeping the "avoid pthread_setaffinity_np in Docker" behavior). kompress_compressor.py and onnx_router.py don't — they'll inherit ORT's default, which scales with detected cores and has historically bitten us on pinned-CPU containers. Either thread the cap through everywhere, or document explicitly why Kompress/image router are safe without it.

  2. OnnxLocalEmbedder.close() doesn't trim. Kompress unload now trims; the embedder close sets _session = None and returns. The issue explicitly calls out "other ONNX session creators likely deserve the same review" — dropping a gc.collect(); trim_process_heap() into close() would make the story symmetric.

Nits (non-blocking)

  • create_cpu_session_options(ort, ...) — passing ort as a parameter works and makes testing with a fake easy, but forces every caller to do import onnxruntime as ort first even though the helper could import it internally. Minor API ergonomics.
  • Module name headroom/onnx_runtime.py reads very close to the third-party onnxruntime package. onnx_utils.py or onnx_runtime_helpers.py would avoid the double-take on import lines.

None of the above blocks landing the fix.

Thank you :)

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 47.61905% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
headroom/onnx_runtime.py 61.53% 10 Missing ⚠️
headroom/transforms/kompress_compressor.py 16.66% 10 Missing ⚠️
headroom/image/onnx_router.py 50.00% 1 Missing ⚠️
headroom/memory/adapters/embedders.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@chopratejas
Copy link
Copy Markdown
Owner

Thank you for chasing this down — memory retention in long-lived proxy processes is the kind of bug that's easy to ignore and a real pain on small VMs. Really appreciate the contribution.

I pulled the branch locally and measured on macOS + onnxruntime 1.24.4 using the cached kompress-int8 model (200 variable-length inferences, each arm in a fresh subprocess so peak-RSS can't bleed across):

arena=ON (main) arena=OFF (this PR) Δ
loaded 391 MB 456 MB +64 MB
peak during infer 438 MB 557 MB +119 MB
after session drop 438 MB 327 MB −110 MB
retained vs baseline 390 MB 276 MB −113 MB / −29%

Latency across 300 mixed-shape inferences: mean 396 → 400ms (~1%), p50 386 → 402ms (~4%), p95 flat. The "warm runtime stayed effectively unchanged" claim holds on this workload.

Two small things worth mentioning in the description so operators aren't surprised:

  1. Peak RSS during inference goes up ~27% (438 → 557 MB here). Retention drops, but there's a real trade on memory-pressured boxes. Worth calling out explicitly.
  2. In kompress_compressor._load_kompress_onnx, the old call was ort.InferenceSession(onnx_path) (ORT picks the best available provider). The new call pins providers=["CPUExecutionProvider"], which silently changes behavior for anyone who had onnxruntime-gpu installed. Probably the intended behavior, but worth making it explicit.

Thinking beyond this PR — the arena is only ~20% of the resident set. A rough decomposition of the 456 MB:

  • Python + imports: ~50 MB
  • ORT C++ binary: ~80–120 MB
  • Kompress INT8 weights: ~200–260 MB ← the real elephant
  • Graph optimizer temporaries: ~40–80 MB
  • Tokenizer + vocab: ~30–60 MB
  • CPU arena (this PR): ~100 MB retained

A few directions for follow-up PRs, from lowest effort / highest leverage:

  1. MALLOC_ARENA_MAX=2 + jemalloc LD_PRELOAD in the Dockerfile / systemd unit. Stacks with this PR; typically another 15–35% RSS reduction for long-lived Python processes. Pure deployment change, no code.
  2. ORT session.use_device_allocator_for_initializers=1 + optional ORT_ENABLE_BASIC graph opt via HEADROOM_ONNX_LOW_MEM=1. Small, free-ish wins you could add to the new helper.
  3. Lazy-load Kompress on first eligible request instead of at proxy startup. For proxies where most traffic doesn't trigger ML compression, this is ~400 MB saved per process until actually needed.
  4. Idle unload + RSS-based backpressure, leaning on the trim_process_heap() helper you just added. Gives ops an actual memory SLO.
  5. Distill a smaller Kompress (6-layer MiniLM-style, ~25 MB INT8). Biggest hammer — ~250 MB off resident RAM — but a real project. Worth scoping if "Headroom on a small VM" becomes a first-class goal.

Maybe a follow-up "low-memory deployment profile" PR that packages 1–4 behind a single HEADROOM_LOW_MEMORY=1 env + a short deployment doc, reusing the helpers from this one?

Thanks again — this is solid work and the approach of making it a shared helper makes follow-ups much easier.

@Kayzo
Copy link
Copy Markdown
Contributor Author

Kayzo commented Apr 21, 2026

Hi,

Thanks for you comments. I reran the Docker repro and posted the detailed numbers on issue #211.

Short version: on my Linux/Docker run, this PR materially improves retained RSS after ONNX-heavy Kompress activity, and repeated load/infer/unload cycles stayed bounded.

I also fixed the failing Python 3.12 CI job on this branch:

  • the failure was ruff format --check .
  • I pushed a formatter-only follow-up: 1961ceb
  • verified locally:
    • ruff check .
    • ruff format --check .
    • targeted proxy tests ✅

Your review notes make sense as next iteration items:

  • clarify trim_process_heap() return semantics
  • add explicit malloc_trim argtypes/restype
  • narrow exception handling
  • add trim/unload coverage
  • keep thread caps consistent across ONNX call sites
  • make OnnxLocalEmbedder.close() symmetrical

So at this point:

  • the branch is green for the formatter issue
  • the Docker repro supports the memory-retention fix
  • your review comments still look like the right follow-up improvements

@chopratejas chopratejas merged commit 80920ed into chopratejas:main Apr 21, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] ONNX Runtime paths retain large anonymous RSS in long-running Headroom proxy processes

2 participants