Skip to content

cuda: add direct-model partial weight cache#153

Open
ddxxlao wants to merge 3 commits into
antirez:mainfrom
ddxxlao:codex/cuda-partial-weight-cache
Open

cuda: add direct-model partial weight cache#153
ddxxlao wants to merge 3 commits into
antirez:mainfrom
ddxxlao:codex/cuda-partial-weight-cache

Conversation

@ddxxlao
Copy link
Copy Markdown

@ddxxlao ddxxlao commented May 15, 2026

Summary

This PR adds an opt-in CUDA direct-model partial weight cache for GPUs that
cannot keep the full GGUF weight image in VRAM.

When DS4_CUDA_PARTIAL_WEIGHT_CACHE=1 is enabled, startup preloads a prioritized
subset of high-benefit DS4 weights into device memory while leaving uncached
weights on the existing direct-model path. Existing CUDA full-cache and
direct-model behavior is unchanged unless the new environment variable is set.

Motivation

On 24 GB RTX 4090-class GPUs, the full CUDA model cache can run out of VRAM
before startup completes. Direct-model mode avoids the startup failure, but
decode is much slower because every generated token repeatedly reads hot dense
weights through the direct path.

The partial weight cache is intended to keep the frequently reused dense path in
VRAM without requiring enough memory for the full model image.

Benchmark Results

Tested on an RTX 4090-class 24 GB GPU with CUDA direct-model mode.

Decode Speed

Mode Cached weights Decode speed vs direct vs 8 GB
Direct baseline 0 1.66 tok/s 1.00x -
Partial cache 8 GB 7.21 GiB 6.93 tok/s 4.17x 1.00x
Partial cache 10 GB 9.88 GiB 7.77 tok/s 4.68x 1.12x

End-to-End, ctx=2048, gen=128

Mode Wall time vs direct
Direct baseline 227.46s 1.00x
Partial cache 8 GB 161.27s 1.41x
Partial cache 10 GB 152.32s 1.49x

Analysis

The partial weight cache substantially improves decode/generation speed, from
about 1.66 tok/s in direct-model mode to 6.9-7.8 tok/s, or more than a 4x
speedup.

The end-to-end benchmark improves by about 1.4-1.5x, because wall-clock time
also includes prefill, startup cache preparation, and other work that benefits
less from cached weights.

The 8 GB cache mostly fits the high-frequency dense path:

  • attention weights: attn_q_a, attn_q_b, attn_kv, attn_output_a,
    attn_output_b
  • compressor and indexer weights
  • shared FFN weights
  • output head

These weights are read repeatedly for every generated token and every layer, so
placing them in VRAM has a large effect on decode speed.

The 10 GB cache additionally includes:

  • token_embd
  • a small amount of routed expert weights

This gives another modest decode improvement, but the marginal gain is smaller.
The routed expert weights are large, but each token only uses a small subset of
experts. The larger raw cache also consumes more VRAM, which can reduce room for
secondary q8-to-fp16 expansion caches and partially offset the benefit.

In short: the main win comes from caching dense hot weights. Going from 8 GB to
10 GB still helps, but the returns are already diminishing.

Testing

  • git diff --check upstream/main...HEAD
  • make cuda-regression

Notable local observation:

  • Default CUDA full-cache startup on the same RTX 4090-class machine failed
    after caching about 16.02 GiB with an out-of-memory error.
  • With partial direct-model caching enabled, startup prepared about 15.46 GiB
    in 486 ranges from 1328 candidates before direct fallback.

@ddxxlao ddxxlao marked this pull request as ready for review May 15, 2026 02:39
Copilot AI review requested due to automatic review settings May 15, 2026 02:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in CUDA partial weight cache for direct-model operation, targeting GPUs that cannot fit the full GGUF weight image in VRAM.

Changes:

  • Adds prioritized DS4 weight candidate collection and partial CUDA startup caching.
  • Updates CUDA cache allocation/lookup paths, cache limits, and direct fallback behavior.
  • Documents the new environment variables and adds CUDA smoke coverage.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
ds4.c Adds partial-cache candidate selection and startup integration.
ds4_cuda.cu Implements forced device range caching, budget handling, and partial/direct fallback logic.
tests/cuda_long_context_smoke.c Adds CUDA smoke tests for partial direct-model cache behavior.
README.md Documents partial CUDA weight-cache usage and controls.
Comments suppressed due to low confidence (2)

ds4_cuda.cu:627

  • This has the same partial-mode inconsistency as the FP16 Q8 cache: partial startup is enabled by DS4_CUDA_PARTIAL_WEIGHT_CACHE alone, but this check only blocks FP32 expansion for uncached weights when DS4_CUDA_DIRECT_MODEL is also set. In partial-only or HMM-direct configurations, runtime FP32 Q8 expansions can still allocate VRAM outside the partial weight-cache selection.
    if (cuda_partial_weight_cache_enabled() && cuda_direct_model_enabled() &&
        !cuda_model_range_is_cached(model_map, offset, weight_bytes)) {
        return NULL;
    }

ds4.c:1516

  • Each candidate stores a formatted label here, but the partial-cache implementation never reads c->label (span labels are regenerated later from priority/span id). This adds per-candidate work and makes the candidate structure look more informative than it is; either use these labels in diagnostics or drop the unused field/population.
    snprintf(c->label,
             sizeof(c->label),
             "%s:%.*s",
             role ? role : "tensor",
             (int)t->name.len,
             t->name.ptr);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ds4_cuda.cu Outdated
Comment on lines +1223 to +1234
if (model_map == g_model_host_base && (g_model_device_owned || g_model_registered)) {
return cuda_model_ptr(model_map, offset);
}

const char *cached = cuda_model_range_lookup_cached(model_map, offset, bytes);
if (cached) return cached;

if (getenv("DS4_CUDA_NO_FD_CACHE") == NULL) {
const char *fd_ptr = cuda_model_range_ptr_from_fd(model_map, offset, bytes, what, 0);
if (fd_ptr) {
cached = cuda_model_range_lookup_cached(model_map, offset, bytes);
if (cached) return cached;
Comment thread ds4_cuda.cu Outdated
}
}
if (!cuda_q8_f16_cache_allowed(label, in_dim, out_dim)) return NULL;
if (cuda_partial_weight_cache_enabled() && cuda_direct_model_enabled() &&
Comment thread tests/cuda_long_context_smoke.c Outdated
Comment on lines +174 to +189
ds4_gpu_tensor_write(x, 0, x_host, sizeof(x_host)) &&
ds4_gpu_cache_model_range(host_model, model_size, 0, model_size, "test_partial_rms") &&
ds4_gpu_rms_norm_weight_tensor(out, x, host_model, model_size, 0, 4, TEST_DS4_RMS_EPS) &&
ds4_gpu_synchronize() &&
ds4_gpu_tensor_read(out, 0, out_host, sizeof(out_host))) {
rc = 0;
const float scale = 1.0f / sqrtf(1.0f + TEST_DS4_RMS_EPS);
for (uint32_t i = 0; i < 4; i++) {
const float want = rms_weight[i] * scale;
if (fabsf(out_host[i] - want) > 1.0e-3f) {
fprintf(stderr,
"partial direct cache rms mismatch index=%u got=%f expected=%f\n",
i,
(double)out_host[i],
(double)want);
rc = 1;
Comment thread ds4.c Outdated
Comment on lines +1435 to +1443
typedef struct {
const ds4_tensor *tensor;
uint64_t off;
uint64_t end;
uint64_t bytes;
uint32_t priority;
uint32_t layer;
uint32_t group;
char label[128];
Comment thread ds4.c Outdated
Comment on lines +1540 to +1560
ADD_GLOBAL(w->output_hc_base, 0, "output_hc_base");
ADD_GLOBAL(w->output_hc_scale, 0, "output_hc_scale");
ADD_GLOBAL(w->output_norm, 0, "output_norm");
ADD_GLOBAL(w->output_hc_fn, 0, "output_hc_fn");
ADD_GLOBAL(w->output, 25, "output");
ADD_GLOBAL(w->token_embd, 30, "token_embd");

for (uint32_t il = 0; il < DS4_N_LAYER; il++) {
const ds4_layer_weights *l = &w->layer[il];
ADD_LAYER(l->hc_attn_scale, 5, "hc_attn_scale");
ADD_LAYER(l->hc_attn_base, 5, "hc_attn_base");
ADD_LAYER(l->attn_norm, 5, "attn_norm");
ADD_LAYER(l->attn_q_a_norm, 5, "attn_q_a_norm");
ADD_LAYER(l->attn_kv_a_norm, 5, "attn_kv_a_norm");
ADD_LAYER(l->attn_sinks, 5, "attn_sinks");
ADD_LAYER(l->attn_compressor_norm, 5, "attn_compressor_norm");
ADD_LAYER(l->indexer_compressor_norm, 5, "indexer_compressor_norm");
ADD_LAYER(l->hc_ffn_scale, 5, "hc_ffn_scale");
ADD_LAYER(l->hc_ffn_base, 5, "hc_ffn_base");
ADD_LAYER(l->ffn_norm, 5, "ffn_norm");
ADD_LAYER(l->ffn_exp_probs_b, 5, "ffn_exp_probs_b");
@ddxxlao ddxxlao force-pushed the codex/cuda-partial-weight-cache branch from d8521d5 to 93ebe25 Compare May 15, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants