Skip to content

feat(docker): ROCm/HIP image variant (:rocm)#355

Merged
davide221 merged 7 commits into
mainfrom
feat/docker-rocm
Jun 9, 2026
Merged

feat(docker): ROCm/HIP image variant (:rocm)#355
davide221 merged 7 commits into
mainfrom
feat/docker-rocm

Conversation

@davide221

Copy link
Copy Markdown
Contributor

Adds the AMD/ROCm sibling of the cuda12 docker image so we publish both :cuda12 and :rocm to ghcr.io/luce-org/lucebox-hub.

What's in it

  • Dockerfile.rocm: rocm/dev-ubuntu base, DFLASH27B_GPU_BACKEND=hip, hipblas/rocblas, ggml-hip rpath + ld.so, gfx1151 default. Carries the same COPY server/share status.html fix as the cuda Dockerfile (the bug fixed in build(docker): lucebox-hub container image + CI release pipeline #334).
  • docker-bake.hcl: DFLASH_HIP_ARCHES + ROCM_VERSION vars; _rocm-base / rocm / rocm-local targets; an all group that builds cuda + rocm.
  • docker.yml: rocm joins the build matrix; Dockerfile.rocm added to the PR paths filter.
  • README: AMD run command (--device /dev/kfd --device /dev/dri) in the Docker quick-start.

Notes

🧙 Built with WOZCODE

AMD/HIP sibling of the cuda12 image so we publish both
ghcr.io/luce-org/lucebox-hub:cuda12 and :rocm.

- Dockerfile.rocm: rocm/dev-ubuntu base, DFLASH27B_GPU_BACKEND=hip,
  hipblas/rocblas, ggml-hip rpath + ld.so, gfx1151 default. Carries the
  same COPY server/share status.html fix as the cuda Dockerfile.
- docker-bake.hcl: DFLASH_HIP_ARCHES + ROCM_VERSION vars; _rocm-base /
  rocm / rocm-local targets; all group builds cuda + rocm.
- docker.yml: rocm joins the build matrix; Dockerfile.rocm in PR paths.
- README: AMD run command (--device /dev/kfd --device /dev/dri) in the
  Docker quick-start.

gfx1151 (Strix Halo) by default; widen via DFLASH_HIP_ARCHES. BSA is
CUDA-only and disabled for HIP.

Co-Authored-By: WOZCODE <contact@withwoz.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 15 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/ci.yml">

<violation number="1" location=".github/workflows/ci.yml:23">
P2: `ruff check .` is a no-op because `[tool.ruff] include = []` excludes every file. The step passes with zero files checked, giving a false sense of lint coverage.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread Dockerfile.rocm Outdated
Comment thread Makefile Outdated
Comment thread pyproject.toml Outdated
Comment thread Dockerfile Outdated
Comment thread .github/workflows/ci.yml
run: bash scripts/check_uv_workspace.sh

- name: Lint Python surfaces touched by lucebox tooling
run: uv run --frozen --extra dev ruff check .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: ruff check . is a no-op because [tool.ruff] include = [] excludes every file. The step passes with zero files checked, giving a false sense of lint coverage.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/ci.yml, line 23:

<comment>`ruff check .` is a no-op because `[tool.ruff] include = []` excludes every file. The step passes with zero files checked, giving a false sense of lint coverage.</comment>

<file context>
@@ -10,20 +10,23 @@ jobs:
         run: bash scripts/check_uv_workspace.sh
 
+      - name: Lint Python surfaces touched by lucebox tooling
+        run: uv run --frozen --extra dev ruff check .
+
   build:
</file context>

Comment thread scripts/build_image.sh
Comment thread Dockerfile Outdated
davide221 and others added 3 commits June 9, 2026 19:21
- docker.yml: add a paths filter to the push:[main] trigger so the publish
  build only fires when an image-affecting file changes, not on every main
  commit (was a ~2h full-arch rebuild per commit, e.g. a docs typo).
- README: widen the Docker image to 62% with a framed border so it reads at
  harness-hero size instead of a thin strip; GPU/tag table beside it; add
  explicit install steps (docker pull + model download + run) for both
  cuda12 and rocm, not just the run command.

Co-Authored-By: WOZCODE <contact@withwoz.com>
- Makefile serve: publish on host :8000 (was :8080) to match the documented
  OpenAI endpoint. (cubic P2)
- Dockerfile + Dockerfile.rocm: drop the unused docker.io runtime package
  (it pulls in containerd + iptables); the entrypoint never shells out to
  docker, only references it in comments. Slims the image and trims
  privileged tooling. (cubic P3)

Co-Authored-By: WOZCODE <contact@withwoz.com>
- pyproject: drop the conflicting [dependency-groups] dev (pytest-only) so
  [project.optional-dependencies] dev (pytest+mypy+ruff, used via
  uv sync --extra dev) is the single source of truth. Re-locked.
- Dockerfile + Dockerfile.rocm: install uv via a pinned
  COPY --from=ghcr.io/astral-sh/uv:0.11.2 instead of curl | sh (no remote
  installer runs at build time; version fixed).
- ruff gate is now real instead of a no-op: include scoped to the host-CLI
  tooling (harness/, scripts/) with F/I/UP/B (line-length/style staged
  out). Auto-fixed + hand-fixed the 15 violations there; server internals
  stay staged-excluded until cleaned.
- build_image.sh: document the untagged-tree pinned sha tag.

ruff check . and uv lock --check both pass.

Co-Authored-By: WOZCODE <contact@withwoz.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 9, 2026
davide221 and others added 3 commits June 9, 2026 22:43
…x hosts)

Runtime verification on lucebox2 (gfx1151 Strix Halo, host ROCm 7.2.2):
the 6.4.1-userspace image finds the device but SIGSEGVs during model
load (and reports a bogus 1.28 TB VRAM total) — a 6.4.x-userspace /
7.x-host-driver mismatch. Rebuilding the same image with
ROCM_VERSION=7.2.2 works end-to-end: server up, /health + /props OK,
coherent chat completion at 12 tok/s decode on the iGPU.

Default both Dockerfile.rocm and docker-bake.hcl to 7.2.2 so the
published :rocm image runs on current ROCm 7.x host stacks; 6.4.x
remains available via the build arg for hosts still on a 6.x driver.

Co-Authored-By: WOZCODE <contact@withwoz.com>
…trix Halo)

Revert the 7.2.2 default: the 7.2.x stack has shown intermittent
problems on Strix Halo, so the published :rocm image stays on 6.4.1.
The 6.4.x-userspace / 7.x-host-driver segfault from the previous commit
remains documented in Dockerfile.rocm + docker-bake.hcl: on a ROCm 7.x
host, rebuild with ROCM_VERSION=7.2.2 to match the host driver.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221 davide221 merged commit d2e58c1 into main Jun 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant