cuda: fix HMM path on coherent unified-memory systems (GB10 NVLink-C2C) by Mikehutu · Pull Request #158 · antirez/ds4

Mikehutu · 2026-05-15T08:04:45Z

Problem

On systems where cudaHostRegister() returns cudaErrorNotSupported (e.g.
NVIDIA Grace-Blackwell GB10 / DGX Spark with NVLink-C2C unified memory), the
else branch in ds4_gpu_set_model_map() only logged the error and returned.
g_model_hmm_direct was never set, so cuda_model_range_is_cached() returned
0 for every tensor range.

This caused cuda_model_range_ptr() to fall through to the per-range
cudaMalloc + cudaMemcpy path on every first tensor access — silently
allocating ~87 GB of redundant device copies of the 80.76 GiB model:

# Without fix, on GB10 running ds4_test:
free -h:       Mem: 121Gi used 98Gi, available 23Gi
nvidia-smi:    ds4_test   87,992 MiB

The model weights end up in CUDA device-private pools (not evictable, invisible
to /proc/PID/smaps, only visible via MemFree shrinking and nvidia-smi's
per-process column). A second model cannot be co-loaded.

Root cause

cuda_model_prefetch_range() was fully implemented and correct — but its only
call site on Linux/CUDA was behind the DS4_CUDA_COPY_MODEL_CHUNKED env var.
On a normal GB10 run:

DS4_CUDA_COPY_MODEL_CHUNKED is not set
cudaHostRegister fails with cudaErrorNotSupported
The else branch clears the error and returns — prefetch never called
g_model_hmm_direct stays 0 → per-range copy runs for every tensor

Fix

Two hunks, 8 lines total:

Hunk 1 — ds4_gpu_set_model_map(): when registration is skipped, call
cuda_model_prefetch_range() which issues cudaMemPrefetchAsync over the
full mmap region and sets g_model_hmm_direct = 1 on success.

Hunk 2 — cuda_model_range_is_cached(): short-circuit to 1 when
g_model_hmm_direct is set, so the per-range copy path is never reached
(the existing check on line ~200 in cuda_model_range_ptr is not reached
because is_cached runs first).

Results on DGX Spark GB10

Hardware: SM_121, 128 GB unified LPDDR5x, NVLink-C2C ~900 GB/s
Model: DeepSeek V4 Flash IQ2_XXS-w2Q2K (80.76 GiB)
CUDA 13.0, driver 580.142

Metric	Before	After
`free -h` used	~98 GB	~26 GB
`free -h` available	~23 GB	~95 GB
nvidia-smi process	87,992 MiB	~26,000 MiB
Weight memory type	CUDA device allocs (locked)	File page cache (reclaimable)
GPU utilization	~50% (copy stalls)	94%
Throughput	degraded	13.81 t/s
Second model co-loadable	No	Yes

Testing

Manually verified on GB10. Full make cuda-regression suite not yet run —
happy to do so if you can confirm there's no known issue running the suite on
SM_121 / CUDA 13.0.

When cudaHostRegister() returns cudaErrorNotSupported the GPU already has full HMM access to all host virtual addresses; no registration is needed. The previous code left the else-branch empty after clearing the error, so g_model_hmm_direct was never set and cuda_model_range_is_cached() kept returning 0 for every tensor range. This caused cuda_model_range_ptr() to fall through to the per-range cudaMalloc+cudaMemcpy path for every tensor on first access, silently allocating ~87 GB of redundant device copies of the model weights on a DGX Spark GB10 (128 GB unified LPDDR5x): Before: free -h shows ~98 GB used / ~23 GB available nvidia-smi process entry: 87,992 MiB After: free -h shows ~26 GB used / ~95 GB available nvidia-smi process entry: ~26,000 MiB (KV cache only) Fix: when registration is skipped, call cuda_model_prefetch_range() which was already fully implemented but had no call site on the normal Linux/CUDA path. It issues cudaMemPrefetchAsync over the full mmap region (async, on a side stream) and sets g_model_hmm_direct=1 on success. The companion hunk in cuda_model_range_is_cached() short-circuits to 1 when g_model_hmm_direct is set, so the per-range copy path is never reached. Tested manually on DGX Spark GB10 (SM_121, 128 GB, CUDA 13.0, driver 580.142) with DeepSeek V4 Flash IQ2_XXS-w2Q2K (80.76 GiB): 13.81 t/s, 94% GPU util.

Mikehutu · 2026-05-15T08:07:52Z

Regression test results — GB10 (SM_121, CUDA 13.0, driver 580.142)

$ make cuda-spark && make cuda-regression
./tests/cuda_long_context_smoke
ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.009s
cuda long-context regression: OK

Passes. First cold run took 2.13s (CUDA JIT compiling the top-k kernel to SASS for the first time); subsequent runs are ~0.009s once the compute cache is warm. Not related to this patch — the top-k kernel is untouched.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: fix HMM path on coherent unified-memory systems (GB10 NVLink-C2C)#158

cuda: fix HMM path on coherent unified-memory systems (GB10 NVLink-C2C)#158
Mikehutu wants to merge 1 commit into
antirez:mainfrom
Mikehutu:hmm-nvlink-c2c-fix

Mikehutu commented May 15, 2026

Uh oh!

Mikehutu commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mikehutu commented May 15, 2026

Problem

Root cause

Fix

Results on DGX Spark GB10

Testing

Uh oh!

Mikehutu commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant