cuda: fix HMM path on coherent unified-memory systems (GB10 NVLink-C2C)#158
Open
Mikehutu wants to merge 1 commit into
Open
cuda: fix HMM path on coherent unified-memory systems (GB10 NVLink-C2C)#158Mikehutu wants to merge 1 commit into
Mikehutu wants to merge 1 commit into
Conversation
When cudaHostRegister() returns cudaErrorNotSupported the GPU already has
full HMM access to all host virtual addresses; no registration is needed.
The previous code left the else-branch empty after clearing the error,
so g_model_hmm_direct was never set and cuda_model_range_is_cached()
kept returning 0 for every tensor range.
This caused cuda_model_range_ptr() to fall through to the per-range
cudaMalloc+cudaMemcpy path for every tensor on first access, silently
allocating ~87 GB of redundant device copies of the model weights on a
DGX Spark GB10 (128 GB unified LPDDR5x):
Before: free -h shows ~98 GB used / ~23 GB available
nvidia-smi process entry: 87,992 MiB
After: free -h shows ~26 GB used / ~95 GB available
nvidia-smi process entry: ~26,000 MiB (KV cache only)
Fix: when registration is skipped, call cuda_model_prefetch_range() which
was already fully implemented but had no call site on the normal Linux/CUDA
path. It issues cudaMemPrefetchAsync over the full mmap region (async, on a
side stream) and sets g_model_hmm_direct=1 on success.
The companion hunk in cuda_model_range_is_cached() short-circuits to 1
when g_model_hmm_direct is set, so the per-range copy path is never reached.
Tested manually on DGX Spark GB10 (SM_121, 128 GB, CUDA 13.0, driver 580.142)
with DeepSeek V4 Flash IQ2_XXS-w2Q2K (80.76 GiB): 13.81 t/s, 94% GPU util.
Author
|
Regression test results — GB10 (SM_121, CUDA 13.0, driver 580.142) Passes. First cold run took 2.13s (CUDA JIT compiling the top-k kernel to SASS for the first time); subsequent runs are ~0.009s once the compute cache is warm. Not related to this patch — the top-k kernel is untouched. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On systems where
cudaHostRegister()returnscudaErrorNotSupported(e.g.NVIDIA Grace-Blackwell GB10 / DGX Spark with NVLink-C2C unified memory), the
elsebranch inds4_gpu_set_model_map()only logged the error and returned.g_model_hmm_directwas never set, socuda_model_range_is_cached()returned0 for every tensor range.
This caused
cuda_model_range_ptr()to fall through to the per-rangecudaMalloc+cudaMemcpypath on every first tensor access — silentlyallocating ~87 GB of redundant device copies of the 80.76 GiB model:
The model weights end up in CUDA device-private pools (not evictable, invisible
to
/proc/PID/smaps, only visible viaMemFreeshrinking and nvidia-smi'sper-process column). A second model cannot be co-loaded.
Root cause
cuda_model_prefetch_range()was fully implemented and correct — but its onlycall site on Linux/CUDA was behind the
DS4_CUDA_COPY_MODEL_CHUNKEDenv var.On a normal GB10 run:
DS4_CUDA_COPY_MODEL_CHUNKEDis not setcudaHostRegisterfails withcudaErrorNotSupportedelsebranch clears the error and returns — prefetch never calledg_model_hmm_directstays 0 → per-range copy runs for every tensorFix
Two hunks, 8 lines total:
Hunk 1 —
ds4_gpu_set_model_map(): when registration is skipped, callcuda_model_prefetch_range()which issuescudaMemPrefetchAsyncover thefull mmap region and sets
g_model_hmm_direct = 1on success.Hunk 2 —
cuda_model_range_is_cached(): short-circuit to 1 wheng_model_hmm_directis set, so the per-range copy path is never reached(the existing check on line ~200 in
cuda_model_range_ptris not reachedbecause
is_cachedruns first).Results on DGX Spark GB10
Hardware: SM_121, 128 GB unified LPDDR5x, NVLink-C2C ~900 GB/s
Model: DeepSeek V4 Flash IQ2_XXS-w2Q2K (80.76 GiB)
CUDA 13.0, driver 580.142
free -husedfree -havailableTesting
Manually verified on GB10. Full
make cuda-regressionsuite not yet run —happy to do so if you can confirm there's no known issue running the suite on
SM_121 / CUDA 13.0.