Skip to content

fix(providers): production-correct container slots under podman (GPU devices + ctx_size + extra_args)#674

Merged
thinmintdev merged 2 commits into
mainfrom
fix/container-explicit-gpu-devices-podman
Jun 9, 2026
Merged

fix(providers): production-correct container slots under podman (GPU devices + ctx_size + extra_args)#674
thinmintdev merged 2 commits into
mainfrom
fix/container-explicit-gpu-devices-podman

Conversation

@thinmintdev

Copy link
Copy Markdown
Contributor

Why

Live-cutover prep for the container-runtime epic (#652, cutover #662). Two latent bugs in the merged container path block running real GPU slots under podman (the design's intended runtime) at the bench-tuned config. Found while standing up podman on CT105.

Fixes

1. Explicit GPU device nodes (podman can't recurse /dev/dri)

Podman errors no devices found in /dev/dri where docker silently recurses the directory — CT105's /dev/dri holds an amdgpu node + renderD128 and no card0, so every container slot fails to start under podman. New resolve_gpu_device_paths() enumerates /dev/kfd + each char device under /dev/dri and passes them explicitly (correct for docker too; faithful to docker's prior whole-dir passthrough). Falls back to legacy dir paths off-GPU so CI rendering stays deterministic.

2. context_size + [server].extra_args were dropped

load_sync passed only --host/--port/--model + profile flags, so context_size=131072 in a slot TOML was silently ignored — llama-server booted at its 4096 default. Severe regression for the agent/chat slots whose purpose is 131k. schema.py:260 lists these as supplied ("model, context_size, and port") but they were never wired. Now _render_unit / resolved_command_for_slot emit --ctx-size <n> and append extra_args after profile flags (slot overrides win).

Live validation (CT105, podman, ace-saber 35B MoE)

  • n_ctx = 131072 confirmed via /props (ctx threading works).
  • 51.7 gen tok/s at real 131k (361-tok prompt, 723 prefill) = parity with the bench (52.8 container / 53.6 baremetal).
  • 19.86 GB GTT, no OOM; podman accepts the full enumerated device set.

Note for the cutover (not in this PR)

Container cgroup memory.current reads 0.88 GB vs 19.86 GB real GTT — GPU weights aren't in cgroup RSS, so the #660 max(cgroup, file_estimate) floor is load-bearing for GPU-slot memory attribution (confirm the dashboard reads the floored path). This is the #662 step-7 deferred item.

Tests

TDD throughout — tests/providers/test_gpu.py (device enumeration, hermetic via /dev/null symlinks) + test_container.py (explicit nodes, ctx-size, extra_args, load_sync wiring, resolved_command). 233 passed; ruff clean.

Relates to #652, #662.

🤖 Generated with Claude Code

thinmintdev and others added 2 commits June 8, 2026 20:53
Podman cannot recurse a `--device=/dev/dri` directory the way docker
does — on CT105 (and any host whose /dev/dri holds an `amdgpu` node and
no `card0`) it errors `no devices found in /dev/dri`, so every container
slot fails to start under podman, the design's intended runtime.

Add `resolve_gpu_device_paths()` (_gpu.py): includes /dev/kfd when present
plus every character device under /dev/dri, passed explicitly. Falls back
to the legacy `["/dev/kfd","/dev/dri"]` off-GPU so unit rendering stays
deterministic in CI. `_render_unit` and `container_spec` now source devices
from it. This is correct for docker too (explicit nodes work either way)
and faithful to docker's prior whole-directory passthrough.

Verified on CT105: ace-saber loads under podman at 53.6 tok/s (parity with
the docker/baremetal bench); enumerated set is
['/dev/kfd','/dev/dri/amdgpu','/dev/dri/renderD128'].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ner slots

The container path dropped both slot-supplied launch params: load_sync passed
only --host/--port/--model + profile flags, so `context_size=131072` in a slot
TOML was silently ignored and llama-server booted at its 4096 default — a severe
context regression for the agent/chat slots whose whole purpose is 131k. Listed
in schema.py:260 as supplied ("model, context_size, and port") but never wired.

- `_render_unit` + `resolved_command_for_slot` now emit `--ctx-size <n>` and
  append `[server].extra_args` (after profile flags, so slot overrides win).
- `load_sync` extracts both from slot_cfg[model].context_size / [server].extra_args.

Gating fix for the container-runtime cutover (#662): must land before any slot
flips to runtime=container.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@thinmintdev thinmintdev merged commit e8f0cec into main Jun 9, 2026
4 checks passed
@thinmintdev thinmintdev deleted the fix/container-explicit-gpu-devices-podman branch June 9, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant