feat(memory-map): container slot attribution via podman cgroup by thinmintdev · Pull Request #672 · Hal0ai/hal0

thinmintdev · 2026-06-08T13:03:05Z

Summary

Backend: _container_cgroup_mem_bytes(slot_name) in capacity.py probes the live podman/docker cgroup (memory.current) for hal0-slot-<name>. build_per_slot tries this first for every non-NPU resident slot; returns 0 for lemond slots (no container exists) so the lemond file-size + KV estimate path is unchanged.
UI: slotLegendSub() in memory-map.jsx shows image tag (truncated to last 32 chars) or profile slug for container slots instead of the uninformative device=rocm token. Both sidebar and expanded legend variants updated.
Tests: 12 unit tests for the cgroup probe (happy path, no runtime, container absent, timeout, cgroupv1, docker fallback) + 2 new Playwright specs in memory-map-v3.spec.ts.

Closes #660
Parent: #652

Test plan

pytest tests/slots/test_container_cgroup_mem.py -q — 12 passed
ruff check + format --check — clean
UI build (vite build) — no errors
playwright test memory-map-v3.spec.ts — 9 passed (includes 2 new container specs)

🤖 Generated with Claude Code

) Populate per-slot mem_mb for container slots from the podman/docker cgroup (memory.current) so the memory map shows real resident bytes instead of a file-size estimate. Backend (capacity.py): - Add `_container_cgroup_mem_bytes(slot_name)`: resolves podman/docker runtime via shutil.which, runs `<runtime> inspect -f {{.State.Pid}} hal0-slot-<name>`, walks to cgroupv2 memory.current. Returns 0 on any error (container absent = lemond slot, fallback is automatic). - `build_per_slot` path 2: try cgroup probe for every non-NPU slot; use result if >0 (no KV estimate — cgroup already includes all resident bytes). Path 3 (lemond file-size + KV estimate) unchanged. UI (memory-map.jsx): - Add `slotLegendSub(slot)`: container slots show image tag (truncated to last 32 chars) or profile slug in legend; lemond slots keep device token as before. - Expand slot model with `isContainer`, `image`, `profile` fields. - Both sidebar and expanded legend variants use slotLegendSub. Tests: - 12 unit tests for _container_cgroup_mem_bytes (happy path, no runtime, container absent, timeout, cgroupv1, unreadable, docker fallback) and build_per_slot container vs lemond vs NPU paths. - 2 Playwright specs: container legend shows image tag (HAL0_DATA injection via addInitScript); pool bar has ≥2 segments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thinmintdev

VERDICT: APPROVE-ON-GREEN — merge-ready once python matrix + γ-suite pass

ui already green; python (3.11/3.12) + γ-suite pending. Value-parity for lemond/NPU is clean and the structure is sound. One correctness caveat (lead, non-blocking) requires CT105 validation; the cost note is secondary.

SPEC (vs #660): ACs met structurally.

_container_cgroup_mem_bytes: podman→docker runtime detect → inspect PID → /proc//cgroup (v2 unified) → /sys/fs/cgroup//memory.current. Error/timeout(1.5s)/no-runtime/PID≤0/cgroupv1(no "::")/unreadable all → 0. 12 unit tests cover every branch incl docker fallback.
build_per_slot: probe runs only for RESIDENT, non-NPU slots — NPU/FLM branch sets its row and continues BEFORE the probe (test_npu_slot_skips_cgroup_probe asserts the probe is never called). Lemond slots: probe returns 0 → falls through to the ORIGINAL file-size + KV registry estimate. Lemond + NPU mem_mb values are byte-unchanged — verified.
UI: attributeFallback (GTT-split / NPU-divide) is UNTOUCHED and still gated on hasMemMetrics (used only when no slot reports mem_mb). slotLegendSub change is legend SUB-LABEL TEXT only (image tag/profile for container, device token for lemond) — does not touch attribution math. GTT pool ceiling logic unchanged.

CAVEATS:

(LEAD — correctness, non-blocking, needs CT105 validation) The "memory.current includes weights + KV" assumption is the riskiest claim and is unverified on this hardware. These are Strix Halo iGPU slots — model weights live in GTT (system RAM mapped via amdgpu/TTM). Whether GTT/DRM buffers are charged to the memory cgroup is kernel/driver-dependent and historically often NOT (the controller charges anon + page-cache, frequently not DRM buffers). --no-mmap (in every profile) makes the initial read anonymous+charged, but once llama.cpp copies tensors to HIP/GTT and frees the host copy, the resident ~22 GB may sit in GTT that memory.current does not count.
Why this is sharper than a cost note: the if cgroup_bytes > 0: branch has NO sanity floor and continues — ANY positive value wins, even 2 GB. So a partial-accounting cgroup (e.g. ~2 GB CPU-side runtime while the model is ~22 GB) REPLACES the correct ~22 GB registry estimate with a wrong ~2 GB, i.e. under-reports container slots below what the old path showed — only EXACTLY 0 falls back. That directly threatens AC1 ("show resident+KV memory"). All 12 tests mock the cgroup value, so neither they nor green CI can catch this.
VALIDATION (maintainer, post-merge on CT105): load the ace-saber container slot, compare /api/slots mem_mb against the ~22 GB GGUF size. If it reads single-digit GB, the probe is missing GTT and the >0 branch is degrading attribution. Cheap hardening: floor the cgroup value against the file-size estimate — mem = max(cgroup_mb, file_est_mb) — or treat an implausibly-small cgroup (well below model size) like 0 so the fallback fires. Fits the epics deferred-live-validation pattern (#666 tok/s, #661/#669 e2e), so APPROVE-ON-GREEN, but please run the CT105 check before relying on the container numbers.
(P2 — cost) Unconditional inspect subprocess per resident non-NPU slot per memory-map poll. The probe is not gated on any container signal, so an all-lemond install forks a podman/docker inspect for every resident GPU slot on every dashboard poll (returns fast when the container is absent; the 1.5s timeout only bites if the runtime hangs). I confirmed theres no cheap container signal at this layer — the Slot object exposes only state/model_id/metadata/backend, not runtime/profile — so the authors "probe everything" is a defensible tradeoff, not an oversight. If metadata carries profile, gating on meta.get("profile") would skip the fork for lemond slots; otherwise plumb a runtime flag into the Slot. P2.

TRIVIA / OUT OF SCOPE:

Expanded-legend variant now shows · ≈ for approx lemond slots (old code was bare s.device) — trivially additive, cosmetic.
memory-map.jsx imports isSlotLive from slot-status.js (#668-inherited parity item) — NOT touched by this PR; out of scope, remains a #668 follow-up.

MERGE-READY: yes, on green python + γ. Caveat #1 is a post-merge CT105 validation + likely a small follow-up floor, not a code blocker; #2 is P2. Closes the #652 container-runtime epic chain.

#672 review: on Strix Halo (UMA) model weights live in GTT (system RAM via amdgpu/TTM) and are often NOT charged to the process memory cgroup. A live container can report a cgroup of only ~2 GB (runtime/buffers) while holding a ~22 GB model. The previous `if cgroup_bytes > 0` used the cgroup unconditionally, REPLACING the correct registry estimate with a wrong-LOW value — under-reporting container memory. Fix: compute BOTH the cgroup value and the registry file-size + KV estimate for container slots, and use `max(cgroup_mb, estimate_mb)`. When the cgroup accounts for weights it wins (>= estimate); when GTT isn't charged the estimate wins → never under-reports. lemond/NPU paths unchanged. Tests: add under-report case (2 GB cgroup, 22 GB model -> estimate wins) and cgroup-wins case (24 GB cgroup > 22 GB model -> cgroup wins). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thinmintdev commented Jun 8, 2026

View reviewed changes

thinmintdev mentioned this pull request Jun 8, 2026

feat: container image-pull progress SSE + UI pull bar #673

Merged

10 tasks

thinmintdev merged commit 0533394 into main Jun 8, 2026
4 checks passed

thinmintdev deleted the issue-660-memmap branch June 8, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory-map): container slot attribution via podman cgroup#672

feat(memory-map): container slot attribution via podman cgroup#672
thinmintdev merged 2 commits into
mainfrom
issue-660-memmap

thinmintdev commented Jun 8, 2026

Uh oh!

thinmintdev left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thinmintdev commented Jun 8, 2026

Summary

Test plan

Uh oh!

thinmintdev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant