fix(providers): production-correct container slots under podman (GPU devices + ctx_size + extra_args) by thinmintdev · Pull Request #674 · Hal0ai/hal0

thinmintdev · 2026-06-09T01:03:50Z

Why

Live-cutover prep for the container-runtime epic (#652, cutover #662). Two latent bugs in the merged container path block running real GPU slots under podman (the design's intended runtime) at the bench-tuned config. Found while standing up podman on CT105.

Fixes

1. Explicit GPU device nodes (podman can't recurse `/dev/dri`)

Podman errors no devices found in /dev/dri where docker silently recurses the directory — CT105's /dev/dri holds an amdgpu node + renderD128 and no card0, so every container slot fails to start under podman. New resolve_gpu_device_paths() enumerates /dev/kfd + each char device under /dev/dri and passes them explicitly (correct for docker too; faithful to docker's prior whole-dir passthrough). Falls back to legacy dir paths off-GPU so CI rendering stays deterministic.

2. `context_size` + `[server].extra_args` were dropped

load_sync passed only --host/--port/--model + profile flags, so context_size=131072 in a slot TOML was silently ignored — llama-server booted at its 4096 default. Severe regression for the agent/chat slots whose purpose is 131k. schema.py:260 lists these as supplied ("model, context_size, and port") but they were never wired. Now _render_unit / resolved_command_for_slot emit --ctx-size <n> and append extra_args after profile flags (slot overrides win).

Live validation (CT105, podman, ace-saber 35B MoE)

n_ctx = 131072 confirmed via /props (ctx threading works).
51.7 gen tok/s at real 131k (361-tok prompt, 723 prefill) = parity with the bench (52.8 container / 53.6 baremetal).
19.86 GB GTT, no OOM; podman accepts the full enumerated device set.

Note for the cutover (not in this PR)

Container cgroup memory.current reads 0.88 GB vs 19.86 GB real GTT — GPU weights aren't in cgroup RSS, so the #660 max(cgroup, file_estimate) floor is load-bearing for GPU-slot memory attribution (confirm the dashboard reads the floored path). This is the #662 step-7 deferred item.

Tests

TDD throughout — tests/providers/test_gpu.py (device enumeration, hermetic via /dev/null symlinks) + test_container.py (explicit nodes, ctx-size, extra_args, load_sync wiring, resolved_command). 233 passed; ruff clean.

Relates to #652, #662.

🤖 Generated with Claude Code

Podman cannot recurse a `--device=/dev/dri` directory the way docker does — on CT105 (and any host whose /dev/dri holds an `amdgpu` node and no `card0`) it errors `no devices found in /dev/dri`, so every container slot fails to start under podman, the design's intended runtime. Add `resolve_gpu_device_paths()` (_gpu.py): includes /dev/kfd when present plus every character device under /dev/dri, passed explicitly. Falls back to the legacy `["/dev/kfd","/dev/dri"]` off-GPU so unit rendering stays deterministic in CI. `_render_unit` and `container_spec` now source devices from it. This is correct for docker too (explicit nodes work either way) and faithful to docker's prior whole-directory passthrough. Verified on CT105: ace-saber loads under podman at 53.6 tok/s (parity with the docker/baremetal bench); enumerated set is ['/dev/kfd','/dev/dri/amdgpu','/dev/dri/renderD128']. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ner slots The container path dropped both slot-supplied launch params: load_sync passed only --host/--port/--model + profile flags, so `context_size=131072` in a slot TOML was silently ignored and llama-server booted at its 4096 default — a severe context regression for the agent/chat slots whose whole purpose is 131k. Listed in schema.py:260 as supplied ("model, context_size, and port") but never wired. - `_render_unit` + `resolved_command_for_slot` now emit `--ctx-size <n>` and append `[server].extra_args` (after profile flags, so slot overrides win). - `load_sync` extracts both from slot_cfg[model].context_size / [server].extra_args. Gating fix for the container-runtime cutover (#662): must land before any slot flips to runtime=container. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thinmintdev and others added 2 commits June 8, 2026 20:53

thinmintdev merged commit e8f0cec into main Jun 9, 2026
4 checks passed

thinmintdev deleted the fix/container-explicit-gpu-devices-podman branch June 9, 2026 01:13

This was referenced Jun 9, 2026

fix(dispatcher): container slot preempts composite registry binding (cutover routing) #676

Merged

P2 live cutover: agent + chat to containers on CT105 #662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(providers): production-correct container slots under podman (GPU devices + ctx_size + extra_args)#674

fix(providers): production-correct container slots under podman (GPU devices + ctx_size + extra_args)#674
thinmintdev merged 2 commits into
mainfrom
fix/container-explicit-gpu-devices-podman

thinmintdev commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thinmintdev commented Jun 9, 2026

Why

Fixes

1. Explicit GPU device nodes (podman can't recurse /dev/dri)

2. context_size + [server].extra_args were dropped

Live validation (CT105, podman, ace-saber 35B MoE)

Note for the cutover (not in this PR)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Explicit GPU device nodes (podman can't recurse `/dev/dri`)

2. `context_size` + `[server].extra_args` were dropped