Skip to content

feat(kompress): HEADROOM_KOMPRESS_BACKEND env + GPU/MPS auto-detect#204

Open
SwiftWing21 wants to merge 5 commits intochopratejas:mainfrom
SwiftWing21:feat/kompress-backend-selection
Open

feat(kompress): HEADROOM_KOMPRESS_BACKEND env + GPU/MPS auto-detect#204
SwiftWing21 wants to merge 5 commits intochopratejas:mainfrom
SwiftWing21:feat/kompress-backend-selection

Conversation

@SwiftWing21
Copy link
Copy Markdown
Contributor

@SwiftWing21 SwiftWing21 commented Apr 19, 2026

Summary

Closes #202.

Adds HEADROOM_KOMPRESS_BACKEND env var (auto / onnx / pytorch) and
teaches auto mode to prefer the PyTorch backend when CUDA or Apple-Silicon
MPS is available. Previously _load_kompress always tried ONNX first
whenever onnxruntime was importable — which is always true for
headroom-ai[proxy]. This left GPU-equipped users on a CPU-only path.

Selection order in auto (default):

  1. If PyTorch is installed AND (torch.cuda.is_available() OR
    Apple-Silicon MPS), prefer PyTorch; fall back to ONNX on failure.
  2. Else, prefer ONNX; fall back to PyTorch on failure.
  3. Raise ImportError if neither is available.

Apple Silicon detection uses
platform.machine() == "arm64" and platform.system() == "Darwin", which
Apple has committed to keeping stable across M-series generations (M1 / M2
/ M3 / M4 / ...).

Invalid values (e.g. HEADROOM_KOMPRESS_BACKEND=tensorflow) log a warning
rather than silently falling back to auto, so misconfiguration is visible.

Behavior on existing deployments

  • Linux/Windows CPU-only: no change — auto falls through to ONNX
    exactly like before.
  • NVIDIA + PyTorch installed: now auto-selects CUDA via PyTorch.
  • Apple Silicon + PyTorch installed: now auto-selects MPS.
  • Anyone can revert HEADROOM_KOMPRESS_BACKEND=onnx
    reverts.

Test plan

  • Unit: HEADROOM_KOMPRESS_BACKEND=onnx forces ONNX
  • Unit: HEADROOM_KOMPRESS_BACKEND=pytorch forces PyTorch
  • Unit: auto + fake Apple-Silicon + MPS -> PyTorch
  • Unit: auto + fake CUDA -> PyTorch
  • Unit: auto + no accelerator -> ONNX (regression guard)
  • Unit: auto + PyTorch load error -> falls back to ONNX
  • Sanity: invalid env var logs warning and falls through
  • Full kompress_compressor.py test file: 27/27 pass
  • Full transforms/ test directory: 545 pass, 35 skip, 1 pre-existing fail unrelated to this PR (see below)
  • Manual: NVIDIA box — verify real CUDA path works end-to-end
  • Manual: Apple Silicon — verify real MPS path works end-to-end and
    confirm speedup from issue [FEATURE] Kompressor Batching Support for Nvidia + Apple M Silicon  #202's field data (2206ms avg -> sub-1s
    expected)

I don't have Apple or NVIDIA CPU hardware locally. Requesting a
maintainer or community reviewer to run the two manual checks before
merge. Code-level correctness is fully covered by unit tests; wall-clock
speedup is not.

Pre-existing CI state: tests/test_transforms/test_universal_json_crush.py::TestFullPipelineIntegration::test_number_array_via_compress
fails on current `main` (verified via git diff origin/main — this PR
does not touch universal_json_crush.py or its test file). The red CI
on this PR matches what main already shows; not introduced here.

Docs

  • New `### Kompress backend selection` subsection in `wiki/configuration.md`
    covering the env var and the backend comparison table.
  • `CHANGELOG.md` entry under `[Unreleased]` / `Added`.

Commit structure

Five focused commits on the branch (keeping them separate for easier
review + surgical revert if needed). Squash-on-merge is fine if that
matches house style.

🤖 Generated with Claude Code

Allow explicit onnx/pytorch backend selection via env var, and add
auto-detect so PyTorch (with CUDA or Apple-Silicon MPS) is preferred
when an accelerator is available. Falls back gracefully on load
failure. Refs chopratejas#202.
Review followup — the first override test had misleading fake-torch
setup for a code path that short-circuits before accelerator probing.
Rename and trim so the test matches what it actually exercises; the
auto-detect path is tested in follow-up commits per plan.

Also warn when HEADROOM_KOMPRESS_BACKEND is set to an unknown value
rather than silently falling back to auto. Refs chopratejas#202.
Four regression tests for _load_kompress auto mode:
- Apple Silicon + MPS available → picks PyTorch
- CUDA available on any OS → picks PyTorch
- No accelerator → picks ONNX (regression guard for legacy behavior)
- PyTorch load failure → falls back to ONNX

Refs chopratejas#202.
@SwiftWing21
Copy link
Copy Markdown
Contributor Author

Let me know if I need to make any changes.

Kompressor is a core part of Helix-Context stack, so having the auto-detect makes it "easier" on end users.
Expecting to have a friend be able to test with DGX Sparks in a week or two to check off Nvidia devices

@chopratejas
Copy link
Copy Markdown
Owner

Hi,

thanks for making changes - let me test it on my machine (I have a Mac) - and see if I can see the speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Kompressor Batching Support for Nvidia + Apple M Silicon

2 participants