feat(dflash): add DeepSeek V4 Flash backend by howard0su · Pull Request #353 · Luce-Org/lucebox-hub

howard0su · 2026-06-09T09:50:45Z

Implement full DS4 Flash model backend for AR-only decode:

deepseek4_internal.h: data structures (layer, weights, cache, config)
deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding
deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams)
deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1)
deepseek4_daemon.cpp: daemon entry point

Integration:

Register 'deepseek4' arch in backend_factory.cpp
Add to CMakeLists.txt (include path + sources)

Tests:

test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup)
deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)

Implement full DS4 Flash model backend for AR-only decode: - deepseek4_internal.h: data structures (layer, weights, cache, config) - deepseek4_loader.cpp: GGUF loader with all DS4 metadata/tensor binding - deepseek4_graph.cpp: ggml compute graph (MLA attention, KV compression with ratio-4/ratio-128, indexer selective attention, MoE with sqrt(softplus) routing, hash routing, HC residual streams) - deepseek4_backend.cpp: ModelBackend subclass with hybrid hot/cold expert placement (DFLASH_DS4_HYBRID=1) - deepseek4_daemon.cpp: daemon entry point Integration: - Register 'deepseek4' arch in backend_factory.cpp - Add to CMakeLists.txt (include path + sources) Tests: - test_deepseek4_unit.cpp: CPU-only unit tests with synthetic weights (compressor pooling, MoE routing, RMSNorm, grouped output shape, hash routing lookup) - deepseek4-vectors/: official API test vectors ported from ds4 project (greedy decode logprob fixtures for integration testing)

The DS4 Flash GGUF stores rope.scaling.original_context_length as u32 and compress_ratios as i32 array. Handle both type widths gracefully.

The previous approach set dst->data directly but didn't associate the tensor with its backend buffer, causing 'tensor buffer not set' assert. Now uses ggml_backend_tensor_alloc (matching qwen35 loader pattern). Also keeps token_embd on CPU for embedding lookup.

TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.

When full model load fails (e.g., 81GB model on 24GB GPU), automatically fall back to hybrid mode (experts on CPU, core on GPU).

…er shapes - Output projection now correctly uses batched 3D matmul for grouped low-rank: reshape out_a [4096,8192] to [4096,1024,8], reshape q to [4096,8,n_tok], batched matmul → [1024,8,n_tok] → out_b [8192,4096] - Attention placeholder: use reshaped q (correct shape [32768,n_tok]) instead of broken kv×q matmul - Disable compressed context block (shapes incompatible with placeholder)

HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens]. Bypass HC entirely until proper multi-token HC state management is implemented.

The 3D matmul batch dimension (ne[2]) must match between weight and input. Use permute to put n_out_group in ne[2] for both tensors so ggml can broadcast correctly across the group dimension.

Ratio-4 layers use comp_width = 2*head_dim (1024) with 2*ratio state rows. Ratio-128 layers use comp_width = head_dim (512). Indexer uses n_indexer_head_dim (128) as output, not full multi-head width. Pooling placeholder just takes first head_dim elements for now.

sum_rows operates on ne[0] (heads) producing [1, n_comp]. Don't transpose first or elements won't match reshape.

Without ggml_set_input, the graph allocator doesn't allocate buffers for the position tensors, causing 'tensor buffer not set' when we try to set their values before compute.

The I32 position tensors for RoPE in side-effect subgraphs (cpy to external cache buffers) don't get their buffers allocated by gallocr. Skip RoPE for now - output is placeholder anyway. Will fix properly when implementing full compressor pooling logic.

Keep only meaningful error/info prints in the backend.

…ooling - Implement proper tail RoPE: split last n_rot=64 dims, apply rope, concat back. Per-layer freq_base (compressed vs non-compressed layers) with YaRN scaling for compressed layers. - Replace attention placeholder with full SWA dot-product attention: Q@KV^T scaled softmax over ring buffer, weighted sum, inverse tail RoPE on output. - Implement per-dim softmax-weighted pooling for compressor state, replacing the first-row placeholder. - Add I32 array bindings for multi-element position tensors.

Implement the full HC mechanism on CPU for the hybrid path: - HC pre: RMSNorm → matmul with fn tensor → Sinkhorn normalization (20 iters on 4×4 combine matrix) → weighted sum of 4 residual streams - HC post: update all 4 streams using post gates + combine matrix - Output HC pre: sigmoid-weighted stream merge before final norm/logits - Lazy-load HC weight tensors from GPU to CPU on first use (~65MB total) - Restructure hybrid loop: separate attention and FFN into independent graphs with HC pre/post between them (eliminates incorrect residual additions)

Previously only the last token's KV was written to the ring buffer during prefill, causing decode to attend to a nearly empty cache. Now all tokens' KV entries are written to their correct ring buffer positions.

DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and caused completely wrong position encodings.

howard0su added 22 commits June 9, 2026 10:47

fix(deepseek4): handle u32/i32 metadata types in GGUF loader

3504871

The DS4 Flash GGUF stores rope.scaling.original_context_length as u32 and compress_ratios as i32 array. Handle both type widths gracefully.

fix(deepseek4): load all layers (fix layer_end default check)

f9accaf

TargetLoadPlan.layer_end defaults to -1 (not 0), so check for < 0.

fix(deepseek4): auto-fallback to hybrid mode on GPU OOM

731a66c

When full model load fails (e.g., 81GB model on 24GB GPU), automatically fall back to hybrid mode (experts on CPU, core on GPU).

fix(deepseek4): disable HC pre-mix to fix reshape assertion

78c51f8

HC build_hc_pre returns [n_embd] (1D) but the graph expects [n_embd, n_tokens]. Bypass HC entirely until proper multi-token HC state management is implemented.

fix(deepseek4): correct batched grouped output projection

ddcfd23

The 3D matmul batch dimension (ne[2]) must match between weight and input. Use permute to put n_out_group in ne[2] for both tensors so ggml can broadcast correctly across the group dimension.

debug: add layer progress prints for remote debugging

c92698d

fix(deepseek4): cast APE from F16 to F32 before add

bebb91e

debug: more specific crash location prints

880495f

debug: trace MLA vs compressor crash

9ca201e

debug: trace inside MLA attention

2144c7a

fix(deepseek4): indexer score sum_rows axis fix

f0b3a2f

sum_rows operates on ne[0] (heads) producing [1, n_comp]. Don't transpose first or elements won't match reshape.

fix(deepseek4): mark I32 position inputs for gallocr

2bf59d0

Without ggml_set_input, the graph allocator doesn't allocate buffers for the position tensors, causing 'tensor buffer not set' when we try to set their values before compute.

chore(deepseek4): remove debug layer progress prints

64f72c7

Keep only meaningful error/info prints in the backend.

fix(deepseek4): store all prefill KV rows in SWA ring buffer

2291c93

Previously only the last token's KV was written to the ring buffer during prefill, causing decode to attend to a nearly empty cache. Now all tokens' KV entries are written to their correct ring buffer positions.

fix(deepseek4): use standard RoPE mode (sequential pairs), not NEOX

4b0d95d

DS4's rope_tail_ext_inplace rotates consecutive pairs (i, i+1), which is GGML_ROPE_TYPE_DEFAULT. NEOX mode (interleaved halves) was incorrect and caused completely wrong position encodings.

howard0su force-pushed the ds4 branch from 74fb582 to 4b0d95d Compare June 9, 2026 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): add DeepSeek V4 Flash backend#353

feat(dflash): add DeepSeek V4 Flash backend#353
howard0su wants to merge 22 commits into
Luce-Org:mainfrom
howard0su:ds4

howard0su commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard0su commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant