Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
5ee93da
feat(common): KVFlash pager core + chunk-relevance scorer interface
davide221 Jun 12, 2026
a275c51
feat(qwen35): wire KVFlash into the daemon (--kvflash / --kvflash-tau)
davide221 Jun 12, 2026
48da33d
test: KVFlash verification suite (test_kvflash)
davide221 Jun 12, 2026
f268ffb
docs: optimizations/kvflash (README, RESULTS, DESIGN)
davide221 Jun 12, 2026
facecc1
feat(kvflash): port bounded KV residency to qwen35moe, laguna, gemma4
davide221 Jun 12, 2026
be16d30
fix(kvflash): address cubic review findings on PR #373
davide221 Jun 12, 2026
e2f4296
refactor(kvflash): consolidate per-backend duplication into common he…
davide221 Jun 12, 2026
2c9dffe
docs(kvflash): hub README card + hero + Q8_0 footnote on the 256K rows
davide221 Jun 12, 2026
5cb0606
docs(kvflash): state the KV quant in the table headers
davide221 Jun 12, 2026
7a849e0
docs: KVFlash flags in the main README server-flags reference
davide221 Jun 12, 2026
17f6cbc
docs: keep the main-README KVFlash intro model-agnostic
davide221 Jun 12, 2026
9db8472
feat(kvflash): pooled chunked prefill, --kvflash auto, drafter scorer…
davide221 Jun 12, 2026
f699376
feat(kvflash): drafter-scored residency is the default policy
davide221 Jun 12, 2026
a351091
feat(kvflash): cross-tokenizer drafter scoring for laguna/gemma4 + --…
davide221 Jun 12, 2026
5e79666
feat(kvflash): VRAM-aware auto pool sizing
davide221 Jun 12, 2026
321695c
fix(kvflash): pre-ship audit — cubic round 2 + doc refresh
davide221 Jun 12, 2026
470123b
ci: give the ROCm GPU job its own concurrency group
davide221 Jun 12, 2026
58d924d
ci: fail the ROCm job fast with a KFD diagnosis instead of hanging
davide221 Jun 12, 2026
9a17281
feat(kvflash): --ddtree runs on the pool (gate removed)
davide221 Jun 12, 2026
cc42811
ci: ROCm probe survives a D-state hang
davide221 Jun 12, 2026
abb4cf4
feat(kvflash): gemma4 spec decode on the pool + gemma draft-loader re…
davide221 Jun 12, 2026
feef3fd
fix(draft): convert_dflash_to_gguf reads arch from config.json (was 2…
davide221 Jun 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 41 additions & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -128,8 +128,8 @@ jobs:
needs: [uv-workspace]
runs-on: [self-hosted, gpu, sm86]
timeout-minutes: 30
# The box has a single physical GPU: serialize GPU jobs across PRs instead
# of letting concurrent runs clobber each other.
# Serialize CUDA jobs across PRs (one RTX 3090). The ROCm job has its
# own group: different physical GPU, no contention.
concurrency:
group: lucebox3-gpu-runner
cancel-in-progress: false
Expand Down Expand Up @@ -197,15 +197,51 @@ jobs:
needs: [uv-workspace]
runs-on: [self-hosted, rocm, gfx1151]
timeout-minutes: 20
# Same single box as gpu-tests: serialize GPU jobs across PRs.
# Serialize across PRs per GPU. NOT the same group as the CUDA job:
# the combo box has two distinct GPUs (RTX 3090 + Strix iGPU), and a
# shared group only holds one waiting job, so the Radeon leg was
# chronically displaced ("higher priority waiting request") by every
# new CUDA job entering the queue.
concurrency:
group: lucebox3-gpu-runner
group: lucebox3-rocm-runner
cancel-in-progress: false
steps:
- uses: actions/checkout@v4

- name: KFD health (diagnose instead of hanging)
# rocminfo on a wedged KFD blocks in uninterruptible sleep and eats
# the whole 20-minute job timeout. Probe with a hard timeout first,
# and when it hangs, dump the evidence (D-state holders, dmesg) so
# the job fails in seconds with a diagnosis instead of silently.
run: |
# A wedged KFD puts rocminfo in UNINTERRUPTIBLE sleep: timeout(1)
# cannot kill it and a foreground wait blocks until the job
# timeout. Probe in the background (output to a file so no pipe
# keeps the step alive) and enforce the deadline in the shell.
/opt/rocm/bin/rocminfo > /tmp/rocminfo.out 2>&1 &
PROBE=$!
for i in $(seq 1 15); do
kill -0 $PROBE 2>/dev/null || break
sleep 1
done
if kill -0 $PROBE 2>/dev/null; then
echo "::error::rocminfo hung (likely D-state) — ROCm/KFD wedged; the box needs a reboot"
echo "--- probe state:"
ps -o pid,stat,wchan:32,comm -p $PROBE || true
echo "--- processes holding /dev/kfd:"
sudo fuser -v /dev/kfd 2>&1 || true
echo "--- D-state processes:"
ps -eo pid,user,stat,wchan:32,comm | awk '$3 ~ /D/' || true
echo "--- recent amdgpu/kfd dmesg:"
sudo dmesg 2>/dev/null | grep -iE "amdgpu|kfd" | tail -15 || true
kill -9 $PROBE 2>/dev/null || true
disown $PROBE 2>/dev/null || true
exit 1
fi
wait $PROBE && echo "KFD healthy" || { echo "::error::rocminfo exited non-zero"; cat /tmp/rocminfo.out | tail -5; exit 1; }

- name: ROCm smoke (rocminfo sees gfx1151)
run: /opt/rocm/bin/rocminfo | grep -E "Name:|Marketing Name:" | grep -iE "gfx1151|Radeon 8060S"
run: cat /tmp/rocminfo.out | grep -E "Name:|Marketing Name:" | grep -iE "gfx1151|Radeon 8060S"

- name: Build + run HIP vector-add on the Radeon 8060S
# Self-contained HIP kernel correctness test (no model weights). This is
Expand Down
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ Each one is self-contained with setup instructions and benchmark notes.
<a href="optimizations/spark/"><img src="assets/cards/spark_card.png" alt="Luce Spark MoE expert offload" width="46%"></a>
</p>

<p align="center">
<a href="optimizations/kvflash/"><img src="assets/cards/kvflash_card.png" alt="Luce KVFlash paged KV cache" width="46%"></a>
</p>

---

## Supported Models & Drafters
Expand Down Expand Up @@ -276,6 +280,18 @@ DFLASH27B_KV_TQ3=1 \
| `--kv-cache-dir <path>` | — | Persist prefix cache to disk |
| `--kv-cache-budget N` | — | On-disk cache size cap |

**Bounded KV residency (KVFlash)**

Pages the attention KV cache through a fixed pool of GPU slots; cold 64-token chunks live in host RAM, bit-exact and recallable. Decode speed stops depending on context length and resident KV stays pool-sized at any context. Off by default; works on every model family. Drafter-scored residency is the default on every family: the server finds the Qwen3-0.6B drafter next to the model (or via `--prefill-drafter`) and lazy-loads it as the relevance scorer that decides which chunks stay resident — non-qwen targets (laguna, gemma4) bridge the tokenizer gap by re-tokenizing the context text for the drafter. LRU is the fallback when no drafter is present, or the explicit choice via `--kvflash-policy lru`. Per-model numbers in [Luce KVFlash →](optimizations/kvflash/README.md).

| Flag / env | Default | Effect |
|---|---|---|
| `--kvflash <tokens\|auto>` | off | Resident pool size. `auto` sizes from the GPU: half of free VRAM after weights and reserves, at the model's KV density, capped where decode speed stays near the flat optimum (default 16384, override `DFLASH_KVFLASH_MAX_POOL`) and at `--max-ctx`. Explicit values are rounded to 256, clamped to `--max-ctx`, floored at the protected minimum so eviction always has a victim. |
| `--kvflash-policy {drafter,lru}` | `drafter` | Residency policy. `lru` opts out of the drafter probe/load (recency-only paging, no extra VRAM). |
| `--kvflash-tau N` | `64` | Reselect interval floor (drafter policy only); the effective interval grows with history to cap rescore overhead. |
| `DFLASH_KVFLASH=N` | off | Env equivalent of `--kvflash`. |
| `DFLASH_KVFLASH_TAU=N` | `64` | Env equivalent of `--kvflash-tau`. |

**Thinking budget**

| Flag | Default | Effect |
Expand Down
3 changes: 3 additions & 0 deletions assets/cards/kvflash_card.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading