Skip to content

bfhp: pre-zero quantized padding in loader; add --gpu-host-import flag#9

Open
doctorjei wants to merge 3 commits into
domvox:mainfrom
doctorjei:tq-hip-bfhp-pr
Open

bfhp: pre-zero quantized padding in loader; add --gpu-host-import flag#9
doctorjei wants to merge 3 commits into
domvox:mainfrom
doctorjei:tq-hip-bfhp-pr

Conversation

@doctorjei
Copy link
Copy Markdown
Contributor

Overview

Fix-forward for 4b95731 (merged as a9f3521). The earlier review-feedback commit moved padding-zeroing for external (bfhp) buffers from GPU cudaMemset to host-side memset through ctx->host_ptr, but assumption of writability doesn't always hold; it's fine for --hugepages (PROT_RW during load, PROT_R after) but not for file-mmap (PROT_R from the start) — resulting in SIGSEGV on any non-hugepages model load with bfhp active (yikes).

This Fix

Moves zero-pass into loader (where mapping writability / lifetime are known), removes unsafe host-memset from ggml-cuda.cu, and adds --gpu-host-import flag so users can opt out (or explicitly opt in) for non-hugepages mmap loads.

The fix is broken into 3 commits for readability:

  1. adds GPU host import switch (bfhp) — Pure plumbing; --gpu-host-import / --no-gpu-host-import CLI flags (env LLAMA_ARG_GPU_HOST_IMPORT), and common→model wiring. No behavioral change on its own.

  2. adds zero-fill protection for buffers from host pages in model loader. (walks tensors backed by mapping idx zeros out slide in host mapping. Early-returns for hugetlb mappings (kernel already zero-fills anon allocation). For file-mmap, does a scoped mprotect(PROT_RW) → memset → restore.

  3. streamlines zero-fill to avoid regression — init_tensor no longer host-memsets external buffers (now in loader), so external branch collapses to a no-op and surviving cudaMemset only runs for owned device buffers. Replaces the four NULL vtable slots imported buffer interface with GGML_ABORT wrappers.

Results (Benchmarks)

build path PP t/s TG t/s notes
pre-bfhp 73a6481 mmap (baseline) (baseline)
this PR mmap, --no-gpu-host-import ≈ baseline ≈ baseline bfhp disabled, loader zero-pass skipped
this PR mmap, --gpu-host-import ≈ baseline −5.2% TG bfhp on hipMalloc-cheaper path
this PR --hugepages (AUTO ⇒ on) ≈ baseline +4.0% TG one physical copy of weights, +17.5 GiB free VRAM

Considerations

The --gpu-host-import and --no-gpu-host-import flags were added because, at the moment, host import is only performant with the hugepages implementation. However, it is likely that this solution could be extended to solve problems with other architectures, and there may be cases were a user or developer may need or want to use imported host pages without hugepages.

Due to the performance issue of bfhp without hugepages, the AUTO setting for bfhp defaults to match hugepages - on or off together.

Jeremiah Blanchard added 3 commits April 25, 2026 23:14
bfhp = buffer from host pointer. First commit to add support for the flag; implementation to follow.
When bfhp is active, add zero-fill protection; collapses to no-op when it is known that pages are already "clean".
Activated more nuanced zero-fill to avoid penalty when zero-fill is already known (e.g., hugepages)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant