feat(dflash): add DeepSeek V4 Flash backend#353
Draft
howard0su wants to merge 22 commits into
Draft
Conversation
Implement full DS4 Flash model backend for AR-only decode: - deepseek4_internal.h: data structures (layer, weights, cache, config) - deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding - deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams) - deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1) - deepseek4_daemon.cpp: daemon entry point Integration: - Register 'deepseek4' arch in backend_factory.cpp - Add to CMakeLists.txt (include path + sources) Tests: - test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup) - deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)
The DS4 Flash GGUF stores rope.scaling.original_context_length as u32 and compress_ratios as i32 array. Handle both type widths gracefully.
The previous approach set dst->data directly but didn't associate the tensor with its backend buffer, causing 'tensor buffer not set' assert. Now uses ggml_backend_tensor_alloc (matching qwen35 loader pattern). Also keeps token_embd on CPU for embedding lookup.
TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.
When full model load fails (e.g., 81GB model on 24GB GPU), automatically fall back to hybrid mode (experts on CPU, core on GPU).
…er shapes - Output projection now correctly uses batched 3D matmul for grouped low-rank: reshape out_a [4096,8192] to [4096,1024,8], reshape q to [4096,8,n_tok], batched matmul → [1024,8,n_tok] → out_b [8192,4096] - Attention placeholder: use reshaped q (correct shape [32768,n_tok]) instead of broken kv×q matmul - Disable compressed context block (shapes incompatible with placeholder)
HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens]. Bypass HC entirely until proper multi-token HC state management is implemented.
The 3D matmul batch dimension (ne[2]) must match between weight and input. Use permute to put n_out_group in ne[2] for both tensors so ggml can broadcast correctly across the group dimension.
Ratio-4 layers use comp_width = 2*head_dim (1024) with 2*ratio state rows. Ratio-128 layers use comp_width = head_dim (512). Indexer uses n_indexer_head_dim (128) as output, not full multi-head width. Pooling placeholder just takes first head_dim elements for now.
sum_rows operates on ne[0] (heads) producing [1, n_comp]. Don't transpose first or elements won't match reshape.
Without ggml_set_input, the graph allocator doesn't allocate buffers for the position tensors, causing 'tensor buffer not set' when we try to set their values before compute.
The I32 position tensors for RoPE in side-effect subgraphs (cpy to external cache buffers) don't get their buffers allocated by gallocr. Skip RoPE for now - output is placeholder anyway. Will fix properly when implementing full compressor pooling logic.
Keep only meaningful error/info prints in the backend.
…ooling - Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN scaling for compressed layers. - Replace attention placeholder with full SWA dot-product attention: Q@KV^T scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output. - Implement per-dim softmax-weighted pooling for compressor state, replacing the first-row placeholder. - Add I32 array bindings for multi-element position tensors.
Implement the full HC mechanism on CPU for the hybrid path: - HC pre: RMSNorm → matmul with fn tensor → Sinkhorn normalization (20 iters on 4×4 combine matrix) → weighted sum of 4 residual streams - HC post: update all 4 streams using post gates + combine matrix - Output HC pre: sigmoid-weighted stream merge before final norm/logits - Lazy-load HC weight tensors from GPU to CPU on first use (~65MB total) - Restructure hybrid loop: separate attention and FFN into independent graphs with HC pre/post between them (eliminates incorrect residual additions)
Previously only the last token's KV was written to the ring buffer during prefill, causing decode to attend to a nearly empty cache. Now all tokens' KV entries are written to their correct ring buffer positions.
DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and caused completely wrong position encodings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement full DS4 Flash model backend for AR-only decode:
Integration:
Tests: