Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
4fe62b5
feat: add Gemma4 target + draft model support (26B-A4B MoE & 31B dense)
dusterbloom May 7, 2026
85a1196
test: add Gemma4 TDD smoke tests (all GREEN)
dusterbloom May 7, 2026
978ca01
fix: use correct capture_layer_ids from DFlash draft config.json
dusterbloom May 7, 2026
3335ee2
feat: implement draft KV cache for Gemma4 DFlash speculative decoding
dusterbloom May 7, 2026
7ce68ac
perf: chunked batched prefill for Gemma4 target (12-16x speedup)
dusterbloom May 7, 2026
1ef975f
refactor: update tokenize_prompt.py for Gemma4 with CSV output mode
dusterbloom May 7, 2026
133017d
feat: add Q8_0 quantization script for Gemma4 DFlash draft model
dusterbloom May 7, 2026
1386690
fix: correct Gemma4 DFlash decode — BOS, EOS, and SWA mask
dusterbloom May 7, 2026
33b6e9d
feat: add GGUF draft loader for Gemma4 DFlash + parameterize quantize…
dusterbloom May 7, 2026
d2a2c04
feat: implement Gemma4 pFlash prefill — layer-by-layer block-sparse a…
dusterbloom May 7, 2026
9588c97
fix: add pFlash CLI flags, --tokens-file, and prevent draft KV overflow
dusterbloom May 7, 2026
c15f93a
refactor: remove standalone ggml_turbo_wht calls — rotation now fused…
dusterbloom May 7, 2026
f2c36bc
feat: SWA-aware KV cache allocation with ring-buffer for 64K+ context
dusterbloom May 8, 2026
f2261bd
fix: draft KV ring-buffer wrap instead of crash on overflow
dusterbloom May 8, 2026
333f4e0
perf: hybrid pFlash prefill — batched SWA groups + GRAPH_CHUNK=32K
dusterbloom May 8, 2026
488190e
feat: wire pFlash into Gemma4 chunked prefill via ggml_flash_attn_sparse
dusterbloom May 8, 2026
1017dac
feat: gate pFlash dispatch on supported KV types + buffer-NULL guards
dusterbloom May 8, 2026
5b6ba1b
fix: SWA mask coordinate frame — chunks 2+ were silently corrupted
dusterbloom May 8, 2026
9097311
feat: add daemon mode to test_gemma4_dflash + server.py routing
dusterbloom May 8, 2026
8fa5cd0
chore: point submodule to dusterbloom fork on feature/tq3-kv-cache
dusterbloom May 8, 2026
5fb516d
fix: address 11 P2 review violations + draft KV rolling window
dusterbloom May 8, 2026
3b4a4cb
Merge remote-tracking branch 'origin/main' into feature/gemma4-support
dusterbloom May 8, 2026
8ff5c77
chore: bump submodule for S-buffer probe instrumentation
dusterbloom May 8, 2026
19def9c
fix(gemma4): disable SWA ring opt + add 256-align snap for multi-chunk
dusterbloom May 8, 2026
d68e7c4
feat(gemma4): non-monotonic SWA ring restores VRAM savings
dusterbloom May 9, 2026
ce4da35
feat(gemma4): narrow asymmetric KV (TQ3 → Q8 on captured full-attn)
dusterbloom May 9, 2026
2cb6ec6
chore: bump submodule for TQ3 → f16 dequant + MMA fast path
dusterbloom May 9, 2026
cf76b73
fix(test): auto-prefer Q8 GGUF drafter over BF16 safetensors
dusterbloom May 9, 2026
7eea84b
feat(test): expose --draft-max and --ignore-eos for DFlash dTree tuning
dusterbloom May 9, 2026
1115064
feat(mtp): Phase 2 — load_gemma4_mtp_assistant() loader + 7-assertion…
dusterbloom May 9, 2026
d4659ca
feat(mtp): Phase 3a — build_mtp_step_graph() + 6-assertion shape test
dusterbloom May 9, 2026
05e36e4
feat(mtp): Phase 3b — wire --draft-method {none,dflash,mtp}, byte-ide…
dusterbloom May 9, 2026
138de4d
fix(mtp): h_prev capture site, assistant rope_freqs, KQ scale = 1.0
dusterbloom May 9, 2026
30b2b50
fix(mtp): GQA block-broadcast + KQ mask + SWA-aware KV wrap
dusterbloom May 9, 2026
c56879c
fix(mtp): preserve TQ3_0 into FA + 256-pad K view + shared mask acros…
dusterbloom May 9, 2026
7b62c07
fix(gemma4): allocate+fill SWA mask for n_tokens==1 decode + bump lla…
dusterbloom May 9, 2026
f1f811e
fix(mtp): always provide FA mask for head_dim>=512 (any K type)
dusterbloom May 9, 2026
323e0f4
docs(bench): gemma4 context-scaling plan + prompt corpus + reproducib…
dusterbloom May 9, 2026
b441587
docs(gemma4): debugging journey blog — three fixes, prompt-distributi…
dusterbloom May 9, 2026
e65eefb
docs(gemma4): amend journey blog with corrected pflash + dense ladder
dusterbloom May 9, 2026
bf8653e
docs(bench): scientific harness — 24-cell dense×MoE × code×creative ×…
dusterbloom May 9, 2026
98f72c1
feat(gemma4): port SWA truncation to draft graph + YaRN opt-in
dusterbloom May 10, 2026
e05afcd
feat(gemma4): auto-set GGML_CUDA_NO_VMM=1 on <=24 GiB GPUs
dusterbloom May 10, 2026
4b0c158
fix(gemma4): TQ3 graph-level FWHT rotation contract
dusterbloom May 10, 2026
f008033
chore(submodule): bump dflash/deps/llama.cpp to feature/tq3-kv-cache-…
dusterbloom May 10, 2026
b1dac51
Merge origin/main into feature/gemma4-support — clear PR #131 CONFLIC…
dusterbloom May 10, 2026
87722d3
fix: address 9 P2 cubic-dev-ai review violations
dusterbloom May 10, 2026
3fb16e0
fix(gemma4): replace no-op VMM setenv with runtime warning + build doc
dusterbloom May 10, 2026
4935293
refactor(dflash): unify quantize_draft_q8.py to support qwen + gemma4
dusterbloom May 10, 2026
9ccd827
refactor(dflash): remove f16_convert.cu, use ggml_cpy for type conver…
dusterbloom May 10, 2026
337cee0
refactor(test): group gemma4 tests into dflash/test/gemma4/
dusterbloom May 10, 2026
d8ebd12
feat(gemma4-mtp): γ>1 chain speculative decode (Phase 1+2+3, approach A)
dusterbloom May 10, 2026
233b469
docs(gemma4-mtp): Phase 4 sweep results — γ=2 at 64K is +29% over no-MTP
dusterbloom May 10, 2026
4bcb972
feat(gemma4-mtp): approach B — multi-row h_prev capture, +61% at 64K
dusterbloom May 11, 2026
a32d03d
docs(site): correct stale MTP claims, add γ>1 MTP section
dusterbloom May 11, 2026
30a6b90
Rigorous rewrite: regime split, per-component OVAT, walked-back infla…
dusterbloom May 11, 2026
d71290a
chore(submodule): retrack feature/tq3-kv-cache-clean (rewritten linea…
dusterbloom May 11, 2026
8ff12c2
Merge remote-tracking branch 'origin/main' into feature/gemma4-support
dusterbloom May 11, 2026
1e7612e
docs(site): fill J/tok for Dense 31B MTP γ=2 at 64K and 128K
dusterbloom May 11, 2026
4182fd6
chore(repo): untrack .sisyphus/ working dossier to slim PR
dusterbloom May 11, 2026
96bdd9f
fix(dflash): add missing endif for DFLASH27B_GPU_BACKEND="cuda" block
dusterbloom May 12, 2026
6f4d0ea
fix(dflash): MTP FA-types log reads need_mask, drop static gate
dusterbloom May 12, 2026
80881ca
fix(dflash): bump g_kq_stride_pad to 256 when head_dim >= 512
dusterbloom May 12, 2026
0e4fc06
Merge origin/main into feature/gemma4-support — post-#138/#119/#149 r…
dusterbloom May 12, 2026
44eadf5
chore(submodule): bump dflash/deps/llama.cpp — TQ3 warp-cooperative k…
dusterbloom May 12, 2026
a5a1909
feat(dflash): gate TQ3_0 KV onto pflash sparse FA via DFLASH_PFLASH_TQ3
dusterbloom May 12, 2026
5c81bc2
chore(submodule): bump dflash/deps/llama.cpp — TQ3 dispatch + chunked…
dusterbloom May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,6 @@ fix-plan.md
.env.local
*.pem
*.key

# Sisyphus working dossier (preserve outside repo)
.sisyphus/
4 changes: 2 additions & 2 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[submodule "dflash/deps/llama.cpp"]
path = dflash/deps/llama.cpp
url = https://github.com/Luce-Org/llama.cpp-dflash-ggml.git
branch = luce-dflash
url = https://github.com/dusterbloom/llama-cpp-turboquant-cuda.git
branch = feature/tq3-kv-cache-clean
[submodule "dflash/deps/Block-Sparse-Attention"]
path = dflash/deps/Block-Sparse-Attention
url = https://github.com/mit-han-lab/Block-Sparse-Attention.git
276 changes: 0 additions & 276 deletions .sisyphus/plans/20260428-1430-path-b-deltanet-wmma-scope.md

This file was deleted.

85 changes: 84 additions & 1 deletion dflash/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,10 @@ add_library(dflash27b STATIC
src/qwen3/qwen3_drafter.cpp
src/qwen3/qwen3_loader.cpp
src/qwen3/qwen3_graph.cpp
src/gemma4_target_loader.cpp
src/gemma4_target_graph.cpp
src/gemma4_mtp_graph.cpp
src/gemma4_dflash_graph.cpp
src/flashprefill_q8.cpp
src/kv_cache.cpp
src/kv_quant.cpp
Expand Down Expand Up @@ -246,6 +250,11 @@ elseif(DFLASH27B_GPU_BACKEND STREQUAL "hip")
target_compile_definitions(dflash27b PRIVATE DFLASH27B_BACKEND_HIP=1 GGML_USE_HIP)
endif()

# Backward-compat alias for our gemma4 graph code that uses DFLASH27B_MIN_SM.
# origin/main renamed the variable to _dflash27b_cuda_min_sm; expose both names
# so dflash/src/gemma4_dflash_graph.cpp keeps building unchanged.
target_compile_definitions(dflash27b PRIVATE DFLASH27B_MIN_SM=${_dflash27b_cuda_min_sm})

# FlashPrefill custom kernels.
# CUDA: BF16 WMMA needs sm_80+; on sm_75 we fall back to ggml flash_attn_ext.
# HIP Phase 1 (default): ggml q8 fallback, no custom kernels.
Expand Down Expand Up @@ -283,7 +292,8 @@ elseif(DFLASH27B_GPU_BACKEND STREQUAL "cuda" AND _dflash27b_cuda_min_sm GREATER_
target_sources(dflash27b PRIVATE
src/flashprefill_kernels.cu
src/flashprefill_select.cpp
src/flashprefill.cpp)
src/flashprefill.cpp
src/pflash_ggml_adapter.cpp)
target_compile_definitions(dflash27b PRIVATE DFLASH27B_HAVE_CUDA_WMMA_FLASHPREFILL=1)
endif()

Expand Down Expand Up @@ -525,5 +535,78 @@ if(DFLASH27B_TESTS)
target_link_libraries(${_t} PRIVATE CUDA::cudart)
endif()
endforeach()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/test_gemma4_dflash.cpp")
add_executable(test_gemma4_dflash test/gemma4/test_gemma4_dflash.cpp)
target_include_directories(test_gemma4_dflash PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(test_gemma4_dflash PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(test_gemma4_dflash PRIVATE CUDA::cudart)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_target.cpp")
add_executable(smoke_load_gemma4_target test/gemma4/smoke_load_gemma4_target.cpp)
target_include_directories(smoke_load_gemma4_target PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(smoke_load_gemma4_target PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(smoke_load_gemma4_target PRIVATE CUDA::cudart)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_gemma4_target_forward.cpp")
add_executable(smoke_gemma4_target_forward test/gemma4/smoke_gemma4_target_forward.cpp)
target_include_directories(smoke_gemma4_target_forward PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(smoke_gemma4_target_forward PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(smoke_gemma4_target_forward PRIVATE CUDA::cudart)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_draft.cpp")
add_executable(smoke_load_gemma4_draft test/gemma4/smoke_load_gemma4_draft.cpp)
target_include_directories(smoke_load_gemma4_draft PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(smoke_load_gemma4_draft PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(smoke_load_gemma4_draft PRIVATE CUDA::cudart)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_gemma4_draft_forward.cpp")
add_executable(smoke_gemma4_draft_forward test/gemma4/smoke_gemma4_draft_forward.cpp)
target_include_directories(smoke_gemma4_draft_forward PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(smoke_gemma4_draft_forward PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(smoke_gemma4_draft_forward PRIVATE CUDA::cudart)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/test_gemma4_kv_tq3.cpp")
add_executable(test_gemma4_kv_tq3 test/gemma4/test_gemma4_kv_tq3.cpp)
target_include_directories(test_gemma4_kv_tq3 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(test_gemma4_kv_tq3 PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(test_gemma4_kv_tq3 PRIVATE CUDA::cudart)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/test_flash_attn_sparse.cpp")
add_executable(test_flash_attn_sparse test/test_flash_attn_sparse.cpp)
target_link_libraries(test_flash_attn_sparse PRIVATE dflash27b ggml ggml-cuda ggml-base)
target_include_directories(test_flash_attn_sparse PRIVATE
${CMAKE_CURRENT_SOURCE_DIR}/deps/llama.cpp/ggml/include
${CMAKE_CURRENT_SOURCE_DIR}/deps/llama.cpp/ggml/src
${CMAKE_CURRENT_SOURCE_DIR}/src)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/test_mtp_loader.cpp")
add_executable(test_mtp_loader test/gemma4/test_mtp_loader.cpp)
target_include_directories(test_mtp_loader PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(test_mtp_loader PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(test_mtp_loader PRIVATE CUDA::cudart)
endif()

if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/test_mtp_graph_shapes.cpp")
add_executable(test_mtp_graph_shapes test/gemma4/test_mtp_graph_shapes.cpp)
target_include_directories(test_mtp_graph_shapes PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/src)
target_link_libraries(test_mtp_graph_shapes PRIVATE dflash27b ggml ggml-cuda)
find_package(CUDAToolkit REQUIRED)
target_link_libraries(test_mtp_graph_shapes PRIVATE CUDA::cudart)
endif()
endif() # DFLASH27B_GPU_BACKEND STREQUAL "cuda"
endif()
8 changes: 8 additions & 0 deletions dflash/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,14 @@ DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=16 \

**Requirements:** NVIDIA sm_75+ GPU (2080 Ti, 3090, A10, A40, 4090) or Jetson AGX Thor sm_110, CUDA 12+ (CUDA 13+ required for Thor), 22+ GB VRAM, ~80 GB disk. On Turing (SM 7.5), BF16 draft weights are auto-converted to FP16 at load time for tensor core acceleration.

### Small-VRAM cards (<=24 GiB)

VMM-backed pools waste VRAM on cards under ~24 GiB. The 32 GB VMM pool reservation fragments badly on a 24 GB card and causes prefill+verify cliffs (measured ~50% throughput loss at ctx=64K). Build with:

cmake -DGGML_CUDA_NO_VMM=ON ..

`GGML_CUDA_NO_VMM` is a **compile-time** CMake option — it cannot be set at runtime via environment variable. The dflash test binary prints a runtime warning if it detects <=24 GiB VRAM and the binary was built without this flag.

## How it works

**Block-diffusion draft.** Each step, the draft sees `[last_target_token, MASK×15]` plus the last 5 captured target hidden states. It denoises the masks in a single forward, producing 16 candidate tokens conditioned on real target features. Structurally stronger than chain EAGLE: every position conditions on the same captured context, not its own noisy predictions.
Expand Down
2 changes: 1 addition & 1 deletion dflash/deps/llama.cpp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to provide the patch for ggml.

Submodule llama.cpp updated 892 files
62 changes: 62 additions & 0 deletions dflash/include/gemma4.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
// gemma4 — standalone CUDA library for DFlash speculative decoding of
// Gemma4 models (31B Dense and 26B-A4B MoE) with a DFlash draft model.

#ifndef GEMMA4_H
#define GEMMA4_H

#include <stddef.h>
#include <stdint.h>

#ifdef __cplusplus
extern "C" {
#endif

// ─── Gemma4-31B Dense config ───────────────────────────────────────

#define GEMMA4_31B_HIDDEN 4096
#define GEMMA4_31B_LAYERS 60
#define GEMMA4_31B_N_HEADS 32
#define GEMMA4_31B_N_KV_HEADS 8
#define GEMMA4_31B_HEAD_DIM 128
#define GEMMA4_31B_INTERMEDIATE 16384
#define GEMMA4_31B_VOCAB 262144
#define GEMMA4_31B_SWA_WINDOW 1024

// ─── Gemma4-26B-A4B MoE config ────────────────────────────────────

#define GEMMA4_26B_HIDDEN 4096
#define GEMMA4_26B_LAYERS 30
#define GEMMA4_26B_N_HEADS 32
#define GEMMA4_26B_N_KV_HEADS 8
#define GEMMA4_26B_HEAD_DIM 128
#define GEMMA4_26B_INTERMEDIATE 16384
#define GEMMA4_26B_EXPERT_INTERMEDIATE 2048
#define GEMMA4_26B_N_EXPERTS 128
#define GEMMA4_26B_N_EXPERTS_USED 8
#define GEMMA4_26B_VOCAB 262144
#define GEMMA4_26B_SWA_WINDOW 1024

// ─── Shared constants ─────────────────────────────────────────────

#define GEMMA4_ROPE_THETA 1000000.0f
#define GEMMA4_RMS_EPS 1e-6f
#define GEMMA4_LOGIT_SOFTCAP 30.0f
#define GEMMA4_ATTN_SCALE 1.0f

// ─── Draft model config ───────────────────────────────────────────

#define GEMMA4_DRAFT_LAYERS 5
#define GEMMA4_DRAFT_BLOCK_SIZE 16
#define GEMMA4_DRAFT_N_TARGET_LAYERS 6
#define GEMMA4_31B_DRAFT_MASK_TOKEN_ID 4
#define GEMMA4_26B_DRAFT_MASK_TOKEN_ID 4

// ─── Diagnostics ──────────────────────────────────────────────────

const char * gemma4_last_error(void);

#ifdef __cplusplus
}
#endif

#endif // GEMMA4_H
Loading