-
Notifications
You must be signed in to change notification settings - Fork 227
Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
67 commits
Select commit
Hold shift + click to select a range
4fe62b5
feat: add Gemma4 target + draft model support (26B-A4B MoE & 31B dense)
dusterbloom 85a1196
test: add Gemma4 TDD smoke tests (all GREEN)
dusterbloom 978ca01
fix: use correct capture_layer_ids from DFlash draft config.json
dusterbloom 3335ee2
feat: implement draft KV cache for Gemma4 DFlash speculative decoding
dusterbloom 7ce68ac
perf: chunked batched prefill for Gemma4 target (12-16x speedup)
dusterbloom 1ef975f
refactor: update tokenize_prompt.py for Gemma4 with CSV output mode
dusterbloom 133017d
feat: add Q8_0 quantization script for Gemma4 DFlash draft model
dusterbloom 1386690
fix: correct Gemma4 DFlash decode — BOS, EOS, and SWA mask
dusterbloom 33b6e9d
feat: add GGUF draft loader for Gemma4 DFlash + parameterize quantize…
dusterbloom d2a2c04
feat: implement Gemma4 pFlash prefill — layer-by-layer block-sparse a…
dusterbloom 9588c97
fix: add pFlash CLI flags, --tokens-file, and prevent draft KV overflow
dusterbloom c15f93a
refactor: remove standalone ggml_turbo_wht calls — rotation now fused…
dusterbloom f2c36bc
feat: SWA-aware KV cache allocation with ring-buffer for 64K+ context
dusterbloom f2261bd
fix: draft KV ring-buffer wrap instead of crash on overflow
dusterbloom 333f4e0
perf: hybrid pFlash prefill — batched SWA groups + GRAPH_CHUNK=32K
dusterbloom 488190e
feat: wire pFlash into Gemma4 chunked prefill via ggml_flash_attn_sparse
dusterbloom 1017dac
feat: gate pFlash dispatch on supported KV types + buffer-NULL guards
dusterbloom 5b6ba1b
fix: SWA mask coordinate frame — chunks 2+ were silently corrupted
dusterbloom 9097311
feat: add daemon mode to test_gemma4_dflash + server.py routing
dusterbloom 8fa5cd0
chore: point submodule to dusterbloom fork on feature/tq3-kv-cache
dusterbloom 5fb516d
fix: address 11 P2 review violations + draft KV rolling window
dusterbloom 3b4a4cb
Merge remote-tracking branch 'origin/main' into feature/gemma4-support
dusterbloom 8ff5c77
chore: bump submodule for S-buffer probe instrumentation
dusterbloom 19def9c
fix(gemma4): disable SWA ring opt + add 256-align snap for multi-chunk
dusterbloom d68e7c4
feat(gemma4): non-monotonic SWA ring restores VRAM savings
dusterbloom ce4da35
feat(gemma4): narrow asymmetric KV (TQ3 → Q8 on captured full-attn)
dusterbloom 2cb6ec6
chore: bump submodule for TQ3 → f16 dequant + MMA fast path
dusterbloom cf76b73
fix(test): auto-prefer Q8 GGUF drafter over BF16 safetensors
dusterbloom 7eea84b
feat(test): expose --draft-max and --ignore-eos for DFlash dTree tuning
dusterbloom 1115064
feat(mtp): Phase 2 — load_gemma4_mtp_assistant() loader + 7-assertion…
dusterbloom d4659ca
feat(mtp): Phase 3a — build_mtp_step_graph() + 6-assertion shape test
dusterbloom 05e36e4
feat(mtp): Phase 3b — wire --draft-method {none,dflash,mtp}, byte-ide…
dusterbloom 138de4d
fix(mtp): h_prev capture site, assistant rope_freqs, KQ scale = 1.0
dusterbloom 30b2b50
fix(mtp): GQA block-broadcast + KQ mask + SWA-aware KV wrap
dusterbloom c56879c
fix(mtp): preserve TQ3_0 into FA + 256-pad K view + shared mask acros…
dusterbloom 7b62c07
fix(gemma4): allocate+fill SWA mask for n_tokens==1 decode + bump lla…
dusterbloom f1f811e
fix(mtp): always provide FA mask for head_dim>=512 (any K type)
dusterbloom 323e0f4
docs(bench): gemma4 context-scaling plan + prompt corpus + reproducib…
dusterbloom b441587
docs(gemma4): debugging journey blog — three fixes, prompt-distributi…
dusterbloom e65eefb
docs(gemma4): amend journey blog with corrected pflash + dense ladder
dusterbloom bf8653e
docs(bench): scientific harness — 24-cell dense×MoE × code×creative ×…
dusterbloom 98f72c1
feat(gemma4): port SWA truncation to draft graph + YaRN opt-in
dusterbloom e05afcd
feat(gemma4): auto-set GGML_CUDA_NO_VMM=1 on <=24 GiB GPUs
dusterbloom 4b0c158
fix(gemma4): TQ3 graph-level FWHT rotation contract
dusterbloom f008033
chore(submodule): bump dflash/deps/llama.cpp to feature/tq3-kv-cache-…
dusterbloom b1dac51
Merge origin/main into feature/gemma4-support — clear PR #131 CONFLIC…
dusterbloom 87722d3
fix: address 9 P2 cubic-dev-ai review violations
dusterbloom 3fb16e0
fix(gemma4): replace no-op VMM setenv with runtime warning + build doc
dusterbloom 4935293
refactor(dflash): unify quantize_draft_q8.py to support qwen + gemma4
dusterbloom 9ccd827
refactor(dflash): remove f16_convert.cu, use ggml_cpy for type conver…
dusterbloom 337cee0
refactor(test): group gemma4 tests into dflash/test/gemma4/
dusterbloom d8ebd12
feat(gemma4-mtp): γ>1 chain speculative decode (Phase 1+2+3, approach A)
dusterbloom 233b469
docs(gemma4-mtp): Phase 4 sweep results — γ=2 at 64K is +29% over no-MTP
dusterbloom 4bcb972
feat(gemma4-mtp): approach B — multi-row h_prev capture, +61% at 64K
dusterbloom a32d03d
docs(site): correct stale MTP claims, add γ>1 MTP section
dusterbloom 30a6b90
Rigorous rewrite: regime split, per-component OVAT, walked-back infla…
dusterbloom d71290a
chore(submodule): retrack feature/tq3-kv-cache-clean (rewritten linea…
dusterbloom 8ff12c2
Merge remote-tracking branch 'origin/main' into feature/gemma4-support
dusterbloom 1e7612e
docs(site): fill J/tok for Dense 31B MTP γ=2 at 64K and 128K
dusterbloom 4182fd6
chore(repo): untrack .sisyphus/ working dossier to slim PR
dusterbloom 96bdd9f
fix(dflash): add missing endif for DFLASH27B_GPU_BACKEND="cuda" block
dusterbloom 6f4d0ea
fix(dflash): MTP FA-types log reads need_mask, drop static gate
dusterbloom 80881ca
fix(dflash): bump g_kq_stride_pad to 256 when head_dim >= 512
dusterbloom 0e4fc06
Merge origin/main into feature/gemma4-support — post-#138/#119/#149 r…
dusterbloom 44eadf5
chore(submodule): bump dflash/deps/llama.cpp — TQ3 warp-cooperative k…
dusterbloom a5a1909
feat(dflash): gate TQ3_0 KV onto pflash sparse FA via DFLASH_PFLASH_TQ3
dusterbloom 5c81bc2
chore(submodule): bump dflash/deps/llama.cpp — TQ3 dispatch + chunked…
dusterbloom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -71,3 +71,6 @@ fix-plan.md | |
| .env.local | ||
| *.pem | ||
| *.key | ||
|
|
||
| # Sisyphus working dossier (preserve outside repo) | ||
| .sisyphus/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| [submodule "dflash/deps/llama.cpp"] | ||
| path = dflash/deps/llama.cpp | ||
| url = https://github.com/Luce-Org/llama.cpp-dflash-ggml.git | ||
| branch = luce-dflash | ||
| url = https://github.com/dusterbloom/llama-cpp-turboquant-cuda.git | ||
| branch = feature/tq3-kv-cache-clean | ||
| [submodule "dflash/deps/Block-Sparse-Attention"] | ||
| path = dflash/deps/Block-Sparse-Attention | ||
| url = https://github.com/mit-han-lab/Block-Sparse-Attention.git |
276 changes: 0 additions & 276 deletions
276
.sisyphus/plans/20260428-1430-path-b-deltanet-wmma-scope.md
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| // gemma4 — standalone CUDA library for DFlash speculative decoding of | ||
| // Gemma4 models (31B Dense and 26B-A4B MoE) with a DFlash draft model. | ||
|
|
||
| #ifndef GEMMA4_H | ||
| #define GEMMA4_H | ||
|
|
||
| #include <stddef.h> | ||
| #include <stdint.h> | ||
|
|
||
| #ifdef __cplusplus | ||
| extern "C" { | ||
| #endif | ||
|
|
||
| // ─── Gemma4-31B Dense config ─────────────────────────────────────── | ||
|
|
||
| #define GEMMA4_31B_HIDDEN 4096 | ||
| #define GEMMA4_31B_LAYERS 60 | ||
| #define GEMMA4_31B_N_HEADS 32 | ||
| #define GEMMA4_31B_N_KV_HEADS 8 | ||
| #define GEMMA4_31B_HEAD_DIM 128 | ||
| #define GEMMA4_31B_INTERMEDIATE 16384 | ||
| #define GEMMA4_31B_VOCAB 262144 | ||
| #define GEMMA4_31B_SWA_WINDOW 1024 | ||
|
|
||
| // ─── Gemma4-26B-A4B MoE config ──────────────────────────────────── | ||
|
|
||
| #define GEMMA4_26B_HIDDEN 4096 | ||
| #define GEMMA4_26B_LAYERS 30 | ||
| #define GEMMA4_26B_N_HEADS 32 | ||
| #define GEMMA4_26B_N_KV_HEADS 8 | ||
| #define GEMMA4_26B_HEAD_DIM 128 | ||
| #define GEMMA4_26B_INTERMEDIATE 16384 | ||
| #define GEMMA4_26B_EXPERT_INTERMEDIATE 2048 | ||
| #define GEMMA4_26B_N_EXPERTS 128 | ||
| #define GEMMA4_26B_N_EXPERTS_USED 8 | ||
| #define GEMMA4_26B_VOCAB 262144 | ||
| #define GEMMA4_26B_SWA_WINDOW 1024 | ||
|
|
||
| // ─── Shared constants ───────────────────────────────────────────── | ||
|
|
||
| #define GEMMA4_ROPE_THETA 1000000.0f | ||
| #define GEMMA4_RMS_EPS 1e-6f | ||
| #define GEMMA4_LOGIT_SOFTCAP 30.0f | ||
| #define GEMMA4_ATTN_SCALE 1.0f | ||
|
|
||
| // ─── Draft model config ─────────────────────────────────────────── | ||
|
|
||
| #define GEMMA4_DRAFT_LAYERS 5 | ||
| #define GEMMA4_DRAFT_BLOCK_SIZE 16 | ||
| #define GEMMA4_DRAFT_N_TARGET_LAYERS 6 | ||
| #define GEMMA4_31B_DRAFT_MASK_TOKEN_ID 4 | ||
| #define GEMMA4_26B_DRAFT_MASK_TOKEN_ID 4 | ||
|
|
||
| // ─── Diagnostics ────────────────────────────────────────────────── | ||
|
|
||
| const char * gemma4_last_error(void); | ||
|
|
||
| #ifdef __cplusplus | ||
| } | ||
| #endif | ||
|
|
||
| #endif // GEMMA4_H |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to provide the patch for ggml.