fix(providers): production-correct container slots under podman (GPU devices + ctx_size + extra_args)#674
Merged
Conversation
Podman cannot recurse a `--device=/dev/dri` directory the way docker does — on CT105 (and any host whose /dev/dri holds an `amdgpu` node and no `card0`) it errors `no devices found in /dev/dri`, so every container slot fails to start under podman, the design's intended runtime. Add `resolve_gpu_device_paths()` (_gpu.py): includes /dev/kfd when present plus every character device under /dev/dri, passed explicitly. Falls back to the legacy `["/dev/kfd","/dev/dri"]` off-GPU so unit rendering stays deterministic in CI. `_render_unit` and `container_spec` now source devices from it. This is correct for docker too (explicit nodes work either way) and faithful to docker's prior whole-directory passthrough. Verified on CT105: ace-saber loads under podman at 53.6 tok/s (parity with the docker/baremetal bench); enumerated set is ['/dev/kfd','/dev/dri/amdgpu','/dev/dri/renderD128']. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ner slots
The container path dropped both slot-supplied launch params: load_sync passed
only --host/--port/--model + profile flags, so `context_size=131072` in a slot
TOML was silently ignored and llama-server booted at its 4096 default — a severe
context regression for the agent/chat slots whose whole purpose is 131k. Listed
in schema.py:260 as supplied ("model, context_size, and port") but never wired.
- `_render_unit` + `resolved_command_for_slot` now emit `--ctx-size <n>` and
append `[server].extra_args` (after profile flags, so slot overrides win).
- `load_sync` extracts both from slot_cfg[model].context_size / [server].extra_args.
Gating fix for the container-runtime cutover (#662): must land before any slot
flips to runtime=container.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Live-cutover prep for the container-runtime epic (#652, cutover #662). Two latent bugs in the merged container path block running real GPU slots under podman (the design's intended runtime) at the bench-tuned config. Found while standing up podman on CT105.
Fixes
1. Explicit GPU device nodes (podman can't recurse
/dev/dri)Podman errors
no devices found in /dev/driwhere docker silently recurses the directory — CT105's/dev/driholds anamdgpunode +renderD128and nocard0, so every container slot fails to start under podman. Newresolve_gpu_device_paths()enumerates/dev/kfd+ each char device under/dev/driand passes them explicitly (correct for docker too; faithful to docker's prior whole-dir passthrough). Falls back to legacy dir paths off-GPU so CI rendering stays deterministic.2.
context_size+[server].extra_argswere droppedload_syncpassed only--host/--port/--model+ profile flags, socontext_size=131072in a slot TOML was silently ignored — llama-server booted at its 4096 default. Severe regression for the agent/chat slots whose purpose is 131k.schema.py:260lists these as supplied ("model, context_size, and port") but they were never wired. Now_render_unit/resolved_command_for_slotemit--ctx-size <n>and appendextra_argsafter profile flags (slot overrides win).Live validation (CT105, podman, ace-saber 35B MoE)
n_ctx = 131072confirmed via/props(ctx threading works).Note for the cutover (not in this PR)
Container cgroup
memory.currentreads 0.88 GB vs 19.86 GB real GTT — GPU weights aren't in cgroup RSS, so the #660max(cgroup, file_estimate)floor is load-bearing for GPU-slot memory attribution (confirm the dashboard reads the floored path). This is the #662 step-7 deferred item.Tests
TDD throughout —
tests/providers/test_gpu.py(device enumeration, hermetic via/dev/nullsymlinks) +test_container.py(explicit nodes, ctx-size, extra_args, load_sync wiring, resolved_command). 233 passed; ruff clean.Relates to #652, #662.
🤖 Generated with Claude Code