Skip to content

feat(dflash): add DeepSeek V4 Flash backend#353

Draft
howard0su wants to merge 22 commits into
Luce-Org:mainfrom
howard0su:ds4
Draft

feat(dflash): add DeepSeek V4 Flash backend#353
howard0su wants to merge 22 commits into
Luce-Org:mainfrom
howard0su:ds4

Conversation

@howard0su

Copy link
Copy Markdown
Contributor

Implement full DS4 Flash model backend for AR-only decode:

  • deepseek4_internal.h: data structures (layer, weights, cache, config)
  • deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding
  • deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams)
  • deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1)
  • deepseek4_daemon.cpp: daemon entry point

Integration:

  • Register 'deepseek4' arch in backend_factory.cpp
  • Add to CMakeLists.txt (include path + sources)

Tests:

  • test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup)
  • deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)

howard0su added 22 commits June 9, 2026 10:47
Implement full DS4 Flash model backend for AR-only decode:

- deepseek4_internal.h: data structures (layer, weights, cache, config)
- deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding
- deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression
  with ratio-4/ratio-128, indexer selective attention, MoE with
  sqrt(softplus) routing, hash routing, HC residual streams)
- deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold
  expert placement (DFLASH_DS4_HYBRID=1)
- deepseek4_daemon.cpp: daemon entry point

Integration:
- Register 'deepseek4' arch in backend_factory.cpp
- Add to CMakeLists.txt (include path + sources)

Tests:
- test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights
  (compressor pooling, MoE routing, RMSNorm, grouped output shape,
  hash routing lookup)
- deepseek4-vectors/: official API test vectors ported from ds4 project
  (greedy decode logprob fixtures for integration testing)
The DS4 Flash GGUF stores rope.scaling.original_context_length as u32
and compress_ratios as i32 array. Handle both type widths gracefully.
The previous approach set dst->data directly but didn't associate the
tensor with its backend buffer, causing 'tensor buffer not set' assert.
Now uses ggml_backend_tensor_alloc (matching qwen35 loader pattern).
Also keeps token_embd on CPU for embedding lookup.
TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.
When full model load fails (e.g., 81GB model on 24GB GPU), automatically
fall back to hybrid mode (experts on CPU, core on GPU).
…er shapes

- Output projection now correctly uses batched 3D matmul for grouped
  low-rank: reshape out_a [4096,8192] to [4096,1024,8], reshape q to
  [4096,8,n_tok], batched matmul → [1024,8,n_tok] → out_b [8192,4096]
- Attention placeholder: use reshaped q (correct shape [32768,n_tok])
  instead of broken kv×q matmul
- Disable compressed context block (shapes incompatible with placeholder)
HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens].
Bypass HC entirely until proper multi-token HC state management is implemented.
The 3D matmul batch dimension (ne[2]) must match between weight and input.
Use permute to put n_out_group in ne[2] for both tensors so ggml can
broadcast correctly across the group dimension.
Ratio-4 layers use comp_width = 2*head_dim (1024) with 2*ratio state rows.
Ratio-128 layers use comp_width = head_dim (512).
Indexer uses n_indexer_head_dim (128) as output, not full multi-head width.
Pooling placeholder just takes first head_dim elements for now.
sum_rows operates on ne[0] (heads) producing [1, n_comp].
Don't transpose first or elements won't match reshape.
Without ggml_set_input, the graph allocator doesn't allocate
buffers for the position tensors, causing 'tensor buffer not set'
when we try to set their values before compute.
The I32 position tensors for RoPE in side-effect subgraphs (cpy to
external cache buffers) don't get their buffers allocated by gallocr.
Skip RoPE for now - output is placeholder anyway. Will fix properly
when implementing full compressor pooling logic.
Keep only meaningful error/info prints in the backend.
…ooling

- Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat
  back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN
  scaling for compressed layers.
- Replace attention placeholder with full SWA dot-product attention: Q@KV^T
  scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output.
- Implement per-dim softmax-weighted pooling for compressor state, replacing
  the first-row placeholder.
- Add I32 array bindings for multi-element position tensors.
Implement the full HC mechanism on CPU for the hybrid path:
- HC pre: RMSNorm → matmul with fn tensor → Sinkhorn normalization (20 iters
  on 4×4 combine matrix) → weighted sum of 4 residual streams
- HC post: update all 4 streams using post gates + combine matrix
- Output HC pre: sigmoid-weighted stream merge before final norm/logits
- Lazy-load HC weight tensors from GPU to CPU on first use (~65MB total)
- Restructure hybrid loop: separate attention and FFN into independent graphs
  with HC pre/post between them (eliminates incorrect residual additions)
Previously only the last token's KV was written to the ring buffer during
prefill, causing decode to attend to a nearly empty cache. Now all tokens'
KV entries are written to their correct ring buffer positions.
DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is
GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and
caused completely wrong position encodings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant