cuda: add direct-model partial weight cache by ddxxlao · Pull Request #153 · antirez/ds4

ddxxlao · 2026-05-15T02:35:03Z

Summary

This PR adds an opt-in CUDA direct-model partial weight cache for GPUs that
cannot keep the full GGUF weight image in VRAM.

When DS4_CUDA_PARTIAL_WEIGHT_CACHE=1 is enabled, startup preloads a prioritized
subset of high-benefit DS4 weights into device memory while leaving uncached
weights on the existing direct-model path. Existing CUDA full-cache and
direct-model behavior is unchanged unless the new environment variable is set.

Motivation

On 24 GB RTX 4090-class GPUs, the full CUDA model cache can run out of VRAM
before startup completes. Direct-model mode avoids the startup failure, but
decode is much slower because every generated token repeatedly reads hot dense
weights through the direct path.

The partial weight cache is intended to keep the frequently reused dense path in
VRAM without requiring enough memory for the full model image.

Benchmark Results

Tested on an RTX 4090-class 24 GB GPU with CUDA direct-model mode.

Decode Speed

Mode	Cached weights	Decode speed	vs direct	vs 8 GB
Direct baseline	0	1.66 tok/s	1.00x	-
Partial cache 8 GB	7.21 GiB	6.93 tok/s	4.17x	1.00x
Partial cache 10 GB	9.88 GiB	7.77 tok/s	4.68x	1.12x

End-to-End, `ctx=2048, gen=128`

Mode	Wall time	vs direct
Direct baseline	227.46s	1.00x
Partial cache 8 GB	161.27s	1.41x
Partial cache 10 GB	152.32s	1.49x

Analysis

The partial weight cache substantially improves decode/generation speed, from
about 1.66 tok/s in direct-model mode to 6.9-7.8 tok/s, or more than a 4x
speedup.

The end-to-end benchmark improves by about 1.4-1.5x, because wall-clock time
also includes prefill, startup cache preparation, and other work that benefits
less from cached weights.

The 8 GB cache mostly fits the high-frequency dense path:

attention weights: attn_q_a, attn_q_b, attn_kv, attn_output_a,
attn_output_b
compressor and indexer weights
shared FFN weights
output head

These weights are read repeatedly for every generated token and every layer, so
placing them in VRAM has a large effect on decode speed.

The 10 GB cache additionally includes:

token_embd
a small amount of routed expert weights

This gives another modest decode improvement, but the marginal gain is smaller.
The routed expert weights are large, but each token only uses a small subset of
experts. The larger raw cache also consumes more VRAM, which can reduce room for
secondary q8-to-fp16 expansion caches and partially offset the benefit.

In short: the main win comes from caching dense hot weights. Going from 8 GB to
10 GB still helps, but the returns are already diminishing.

Testing

git diff --check upstream/main...HEAD
make cuda-regression

Notable local observation:

Default CUDA full-cache startup on the same RTX 4090-class machine failed
after caching about 16.02 GiB with an out-of-memory error.
With partial direct-model caching enabled, startup prepared about 15.46 GiB
in 486 ranges from 1328 candidates before direct fallback.

Copilot

Pull request overview

Adds an opt-in CUDA partial weight cache for direct-model operation, targeting GPUs that cannot fit the full GGUF weight image in VRAM.

Changes:

Adds prioritized DS4 weight candidate collection and partial CUDA startup caching.
Updates CUDA cache allocation/lookup paths, cache limits, and direct fallback behavior.
Documents the new environment variables and adds CUDA smoke coverage.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`ds4.c`	Adds partial-cache candidate selection and startup integration.
`ds4_cuda.cu`	Implements forced device range caching, budget handling, and partial/direct fallback logic.
`tests/cuda_long_context_smoke.c`	Adds CUDA smoke tests for partial direct-model cache behavior.
`README.md`	Documents partial CUDA weight-cache usage and controls.

Comments suppressed due to low confidence (2)

ds4_cuda.cu:627

This has the same partial-mode inconsistency as the FP16 Q8 cache: partial startup is enabled by DS4_CUDA_PARTIAL_WEIGHT_CACHE alone, but this check only blocks FP32 expansion for uncached weights when DS4_CUDA_DIRECT_MODEL is also set. In partial-only or HMM-direct configurations, runtime FP32 Q8 expansions can still allocate VRAM outside the partial weight-cache selection.

    if (cuda_partial_weight_cache_enabled() && cuda_direct_model_enabled() &&
        !cuda_model_range_is_cached(model_map, offset, weight_bytes)) {
        return NULL;
    }

ds4.c:1516

Each candidate stores a formatted label here, but the partial-cache implementation never reads c->label (span labels are regenerated later from priority/span id). This adds per-candidate work and makes the candidate structure look more informative than it is; either use these labels in diagnostics or drop the unused field/population.

    snprintf(c->label,
             sizeof(c->label),
             "%s:%.*s",
             role ? role : "tensor",
             (int)t->name.len,
             t->name.ptr);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if (model_map == g_model_host_base && (g_model_device_owned || g_model_registered)) {
+        return cuda_model_ptr(model_map, offset);
+    }
+
+    const char *cached = cuda_model_range_lookup_cached(model_map, offset, bytes);
+    if (cached) return cached;
+
+    if (getenv("DS4_CUDA_NO_FD_CACHE") == NULL) {
+        const char *fd_ptr = cuda_model_range_ptr_from_fd(model_map, offset, bytes, what, 0);
+        if (fd_ptr) {
+            cached = cuda_model_range_lookup_cached(model_map, offset, bytes);
+            if (cached) return cached;


        }
    }
    if (!cuda_q8_f16_cache_allowed(label, in_dim, out_dim)) return NULL;
+    if (cuda_partial_weight_cache_enabled() && cuda_direct_model_enabled() &&


+        ds4_gpu_tensor_write(x, 0, x_host, sizeof(x_host)) &&
+        ds4_gpu_cache_model_range(host_model, model_size, 0, model_size, "test_partial_rms") &&
+        ds4_gpu_rms_norm_weight_tensor(out, x, host_model, model_size, 0, 4, TEST_DS4_RMS_EPS) &&
+        ds4_gpu_synchronize() &&
+        ds4_gpu_tensor_read(out, 0, out_host, sizeof(out_host))) {
+        rc = 0;
+        const float scale = 1.0f / sqrtf(1.0f + TEST_DS4_RMS_EPS);
+        for (uint32_t i = 0; i < 4; i++) {
+            const float want = rms_weight[i] * scale;
+            if (fabsf(out_host[i] - want) > 1.0e-3f) {
+                fprintf(stderr,
+                        "partial direct cache rms mismatch index=%u got=%f expected=%f\n",
+                        i,
+                        (double)out_host[i],
+                        (double)want);
+                rc = 1;


+typedef struct {
+    const ds4_tensor *tensor;
+    uint64_t off;
+    uint64_t end;
+    uint64_t bytes;
+    uint32_t priority;
+    uint32_t layer;
+    uint32_t group;
+    char label[128];


+    ADD_GLOBAL(w->output_hc_base, 0, "output_hc_base");
+    ADD_GLOBAL(w->output_hc_scale, 0, "output_hc_scale");
+    ADD_GLOBAL(w->output_norm, 0, "output_norm");
+    ADD_GLOBAL(w->output_hc_fn, 0, "output_hc_fn");
+    ADD_GLOBAL(w->output, 25, "output");
+    ADD_GLOBAL(w->token_embd, 30, "token_embd");
+
+    for (uint32_t il = 0; il < DS4_N_LAYER; il++) {
+        const ds4_layer_weights *l = &w->layer[il];
+        ADD_LAYER(l->hc_attn_scale, 5, "hc_attn_scale");
+        ADD_LAYER(l->hc_attn_base, 5, "hc_attn_base");
+        ADD_LAYER(l->attn_norm, 5, "attn_norm");
+        ADD_LAYER(l->attn_q_a_norm, 5, "attn_q_a_norm");
+        ADD_LAYER(l->attn_kv_a_norm, 5, "attn_kv_a_norm");
+        ADD_LAYER(l->attn_sinks, 5, "attn_sinks");
+        ADD_LAYER(l->attn_compressor_norm, 5, "attn_compressor_norm");
+        ADD_LAYER(l->indexer_compressor_norm, 5, "indexer_compressor_norm");
+        ADD_LAYER(l->hc_ffn_scale, 5, "hc_ffn_scale");
+        ADD_LAYER(l->hc_ffn_base, 5, "hc_ffn_base");
+        ADD_LAYER(l->ffn_norm, 5, "ffn_norm");
+        ADD_LAYER(l->ffn_exp_probs_b, 5, "ffn_exp_probs_b");


Add CUDA partial weight cache mode

a74829e

ddxxlao marked this pull request as ready for review May 15, 2026 02:39

Copilot AI review requested due to automatic review settings May 15, 2026 02:39

Copilot started reviewing on behalf of ddxxlao May 15, 2026 02:40 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

ddxxlao added 2 commits May 15, 2026 07:07

Fix CUDA partial weight cache precedence

7805507

Document 10 GiB CUDA partial cache limit

93ebe25

ddxxlao force-pushed the codex/cuda-partial-weight-cache branch from d8521d5 to 93ebe25 Compare May 15, 2026 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: add direct-model partial weight cache#153

cuda: add direct-model partial weight cache#153
ddxxlao wants to merge 3 commits into
antirez:mainfrom
ddxxlao:codex/cuda-partial-weight-cache

ddxxlao commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ddxxlao commented May 15, 2026

Summary

Motivation

Benchmark Results

Decode Speed

End-to-End, ctx=2048, gen=128

Analysis

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

End-to-End, `ctx=2048, gen=128`