Skip to content

cuda: fix HMM path on coherent unified-memory systems (GB10 NVLink-C2C)#158

Open
Mikehutu wants to merge 1 commit into
antirez:mainfrom
Mikehutu:hmm-nvlink-c2c-fix
Open

cuda: fix HMM path on coherent unified-memory systems (GB10 NVLink-C2C)#158
Mikehutu wants to merge 1 commit into
antirez:mainfrom
Mikehutu:hmm-nvlink-c2c-fix

Conversation

@Mikehutu
Copy link
Copy Markdown

Problem

On systems where cudaHostRegister() returns cudaErrorNotSupported (e.g.
NVIDIA Grace-Blackwell GB10 / DGX Spark with NVLink-C2C unified memory), the
else branch in ds4_gpu_set_model_map() only logged the error and returned.
g_model_hmm_direct was never set, so cuda_model_range_is_cached() returned
0 for every tensor range.

This caused cuda_model_range_ptr() to fall through to the per-range
cudaMalloc + cudaMemcpy path on every first tensor access — silently
allocating ~87 GB of redundant device copies of the 80.76 GiB model:

# Without fix, on GB10 running ds4_test:
free -h:       Mem: 121Gi used 98Gi, available 23Gi
nvidia-smi:    ds4_test   87,992 MiB

The model weights end up in CUDA device-private pools (not evictable, invisible
to /proc/PID/smaps, only visible via MemFree shrinking and nvidia-smi's
per-process column). A second model cannot be co-loaded.

Root cause

cuda_model_prefetch_range() was fully implemented and correct — but its only
call site on Linux/CUDA was behind the DS4_CUDA_COPY_MODEL_CHUNKED env var.
On a normal GB10 run:

  1. DS4_CUDA_COPY_MODEL_CHUNKED is not set
  2. cudaHostRegister fails with cudaErrorNotSupported
  3. The else branch clears the error and returns — prefetch never called
  4. g_model_hmm_direct stays 0 → per-range copy runs for every tensor

Fix

Two hunks, 8 lines total:

Hunk 1ds4_gpu_set_model_map(): when registration is skipped, call
cuda_model_prefetch_range() which issues cudaMemPrefetchAsync over the
full mmap region and sets g_model_hmm_direct = 1 on success.

Hunk 2cuda_model_range_is_cached(): short-circuit to 1 when
g_model_hmm_direct is set, so the per-range copy path is never reached
(the existing check on line ~200 in cuda_model_range_ptr is not reached
because is_cached runs first).

Results on DGX Spark GB10

Hardware: SM_121, 128 GB unified LPDDR5x, NVLink-C2C ~900 GB/s
Model: DeepSeek V4 Flash IQ2_XXS-w2Q2K (80.76 GiB)
CUDA 13.0, driver 580.142

Metric Before After
free -h used ~98 GB ~26 GB
free -h available ~23 GB ~95 GB
nvidia-smi process 87,992 MiB ~26,000 MiB
Weight memory type CUDA device allocs (locked) File page cache (reclaimable)
GPU utilization ~50% (copy stalls) 94%
Throughput degraded 13.81 t/s
Second model co-loadable No Yes

Testing

Manually verified on GB10. Full make cuda-regression suite not yet run —
happy to do so if you can confirm there's no known issue running the suite on
SM_121 / CUDA 13.0.

When cudaHostRegister() returns cudaErrorNotSupported the GPU already has
full HMM access to all host virtual addresses; no registration is needed.
The previous code left the else-branch empty after clearing the error,
so g_model_hmm_direct was never set and cuda_model_range_is_cached()
kept returning 0 for every tensor range.

This caused cuda_model_range_ptr() to fall through to the per-range
cudaMalloc+cudaMemcpy path for every tensor on first access, silently
allocating ~87 GB of redundant device copies of the model weights on a
DGX Spark GB10 (128 GB unified LPDDR5x):

  Before: free -h shows ~98 GB used / ~23 GB available
          nvidia-smi process entry: 87,992 MiB

  After:  free -h shows ~26 GB used / ~95 GB available
          nvidia-smi process entry: ~26,000 MiB (KV cache only)

Fix: when registration is skipped, call cuda_model_prefetch_range() which
was already fully implemented but had no call site on the normal Linux/CUDA
path. It issues cudaMemPrefetchAsync over the full mmap region (async, on a
side stream) and sets g_model_hmm_direct=1 on success.

The companion hunk in cuda_model_range_is_cached() short-circuits to 1
when g_model_hmm_direct is set, so the per-range copy path is never reached.

Tested manually on DGX Spark GB10 (SM_121, 128 GB, CUDA 13.0, driver 580.142)
with DeepSeek V4 Flash IQ2_XXS-w2Q2K (80.76 GiB): 13.81 t/s, 94% GPU util.
@Mikehutu
Copy link
Copy Markdown
Author

Regression test results — GB10 (SM_121, CUDA 13.0, driver 580.142)

$ make cuda-spark && make cuda-regression
./tests/cuda_long_context_smoke
ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.009s
cuda long-context regression: OK

Passes. First cold run took 2.13s (CUDA JIT compiling the top-k kernel to SASS for the first time); subsequent runs are ~0.009s once the compute cache is warm. Not related to this patch — the top-k kernel is untouched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant