cuda: add direct-model partial weight cache#153
Open
ddxxlao wants to merge 3 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an opt-in CUDA partial weight cache for direct-model operation, targeting GPUs that cannot fit the full GGUF weight image in VRAM.
Changes:
- Adds prioritized DS4 weight candidate collection and partial CUDA startup caching.
- Updates CUDA cache allocation/lookup paths, cache limits, and direct fallback behavior.
- Documents the new environment variables and adds CUDA smoke coverage.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
ds4.c |
Adds partial-cache candidate selection and startup integration. |
ds4_cuda.cu |
Implements forced device range caching, budget handling, and partial/direct fallback logic. |
tests/cuda_long_context_smoke.c |
Adds CUDA smoke tests for partial direct-model cache behavior. |
README.md |
Documents partial CUDA weight-cache usage and controls. |
Comments suppressed due to low confidence (2)
ds4_cuda.cu:627
- This has the same partial-mode inconsistency as the FP16 Q8 cache: partial startup is enabled by
DS4_CUDA_PARTIAL_WEIGHT_CACHEalone, but this check only blocks FP32 expansion for uncached weights whenDS4_CUDA_DIRECT_MODELis also set. In partial-only or HMM-direct configurations, runtime FP32 Q8 expansions can still allocate VRAM outside the partial weight-cache selection.
if (cuda_partial_weight_cache_enabled() && cuda_direct_model_enabled() &&
!cuda_model_range_is_cached(model_map, offset, weight_bytes)) {
return NULL;
}
ds4.c:1516
- Each candidate stores a formatted label here, but the partial-cache implementation never reads
c->label(span labels are regenerated later from priority/span id). This adds per-candidate work and makes the candidate structure look more informative than it is; either use these labels in diagnostics or drop the unused field/population.
snprintf(c->label,
sizeof(c->label),
"%s:%.*s",
role ? role : "tensor",
(int)t->name.len,
t->name.ptr);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1223
to
+1234
| if (model_map == g_model_host_base && (g_model_device_owned || g_model_registered)) { | ||
| return cuda_model_ptr(model_map, offset); | ||
| } | ||
|
|
||
| const char *cached = cuda_model_range_lookup_cached(model_map, offset, bytes); | ||
| if (cached) return cached; | ||
|
|
||
| if (getenv("DS4_CUDA_NO_FD_CACHE") == NULL) { | ||
| const char *fd_ptr = cuda_model_range_ptr_from_fd(model_map, offset, bytes, what, 0); | ||
| if (fd_ptr) { | ||
| cached = cuda_model_range_lookup_cached(model_map, offset, bytes); | ||
| if (cached) return cached; |
| } | ||
| } | ||
| if (!cuda_q8_f16_cache_allowed(label, in_dim, out_dim)) return NULL; | ||
| if (cuda_partial_weight_cache_enabled() && cuda_direct_model_enabled() && |
Comment on lines
+174
to
+189
| ds4_gpu_tensor_write(x, 0, x_host, sizeof(x_host)) && | ||
| ds4_gpu_cache_model_range(host_model, model_size, 0, model_size, "test_partial_rms") && | ||
| ds4_gpu_rms_norm_weight_tensor(out, x, host_model, model_size, 0, 4, TEST_DS4_RMS_EPS) && | ||
| ds4_gpu_synchronize() && | ||
| ds4_gpu_tensor_read(out, 0, out_host, sizeof(out_host))) { | ||
| rc = 0; | ||
| const float scale = 1.0f / sqrtf(1.0f + TEST_DS4_RMS_EPS); | ||
| for (uint32_t i = 0; i < 4; i++) { | ||
| const float want = rms_weight[i] * scale; | ||
| if (fabsf(out_host[i] - want) > 1.0e-3f) { | ||
| fprintf(stderr, | ||
| "partial direct cache rms mismatch index=%u got=%f expected=%f\n", | ||
| i, | ||
| (double)out_host[i], | ||
| (double)want); | ||
| rc = 1; |
Comment on lines
+1435
to
+1443
| typedef struct { | ||
| const ds4_tensor *tensor; | ||
| uint64_t off; | ||
| uint64_t end; | ||
| uint64_t bytes; | ||
| uint32_t priority; | ||
| uint32_t layer; | ||
| uint32_t group; | ||
| char label[128]; |
Comment on lines
+1540
to
+1560
| ADD_GLOBAL(w->output_hc_base, 0, "output_hc_base"); | ||
| ADD_GLOBAL(w->output_hc_scale, 0, "output_hc_scale"); | ||
| ADD_GLOBAL(w->output_norm, 0, "output_norm"); | ||
| ADD_GLOBAL(w->output_hc_fn, 0, "output_hc_fn"); | ||
| ADD_GLOBAL(w->output, 25, "output"); | ||
| ADD_GLOBAL(w->token_embd, 30, "token_embd"); | ||
|
|
||
| for (uint32_t il = 0; il < DS4_N_LAYER; il++) { | ||
| const ds4_layer_weights *l = &w->layer[il]; | ||
| ADD_LAYER(l->hc_attn_scale, 5, "hc_attn_scale"); | ||
| ADD_LAYER(l->hc_attn_base, 5, "hc_attn_base"); | ||
| ADD_LAYER(l->attn_norm, 5, "attn_norm"); | ||
| ADD_LAYER(l->attn_q_a_norm, 5, "attn_q_a_norm"); | ||
| ADD_LAYER(l->attn_kv_a_norm, 5, "attn_kv_a_norm"); | ||
| ADD_LAYER(l->attn_sinks, 5, "attn_sinks"); | ||
| ADD_LAYER(l->attn_compressor_norm, 5, "attn_compressor_norm"); | ||
| ADD_LAYER(l->indexer_compressor_norm, 5, "indexer_compressor_norm"); | ||
| ADD_LAYER(l->hc_ffn_scale, 5, "hc_ffn_scale"); | ||
| ADD_LAYER(l->hc_ffn_base, 5, "hc_ffn_base"); | ||
| ADD_LAYER(l->ffn_norm, 5, "ffn_norm"); | ||
| ADD_LAYER(l->ffn_exp_probs_b, 5, "ffn_exp_probs_b"); |
d8521d5 to
93ebe25
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an opt-in CUDA direct-model partial weight cache for GPUs that
cannot keep the full GGUF weight image in VRAM.
When
DS4_CUDA_PARTIAL_WEIGHT_CACHE=1is enabled, startup preloads a prioritizedsubset of high-benefit DS4 weights into device memory while leaving uncached
weights on the existing direct-model path. Existing CUDA full-cache and
direct-model behavior is unchanged unless the new environment variable is set.
Motivation
On 24 GB RTX 4090-class GPUs, the full CUDA model cache can run out of VRAM
before startup completes. Direct-model mode avoids the startup failure, but
decode is much slower because every generated token repeatedly reads hot dense
weights through the direct path.
The partial weight cache is intended to keep the frequently reused dense path in
VRAM without requiring enough memory for the full model image.
Benchmark Results
Tested on an RTX 4090-class 24 GB GPU with CUDA direct-model mode.
Decode Speed
End-to-End,
ctx=2048, gen=128Analysis
The partial weight cache substantially improves decode/generation speed, from
about
1.66 tok/sin direct-model mode to6.9-7.8 tok/s, or more than a 4xspeedup.
The end-to-end benchmark improves by about
1.4-1.5x, because wall-clock timealso includes prefill, startup cache preparation, and other work that benefits
less from cached weights.
The 8 GB cache mostly fits the high-frequency dense path:
attn_q_a,attn_q_b,attn_kv,attn_output_a,attn_output_bThese weights are read repeatedly for every generated token and every layer, so
placing them in VRAM has a large effect on decode speed.
The 10 GB cache additionally includes:
token_embdThis gives another modest decode improvement, but the marginal gain is smaller.
The routed expert weights are large, but each token only uses a small subset of
experts. The larger raw cache also consumes more VRAM, which can reduce room for
secondary q8-to-fp16 expansion caches and partially offset the benefit.
In short: the main win comes from caching dense hot weights. Going from 8 GB to
10 GB still helps, but the returns are already diminishing.
Testing
git diff --check upstream/main...HEADmake cuda-regressionNotable local observation:
after caching about
16.02 GiBwith an out-of-memory error.15.46 GiBin 486 ranges from 1328 candidates before direct fallback.