Luce-Org · davide221 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -128,8 +128,8 @@ jobs:
     needs: [uv-workspace]
     runs-on: [self-hosted, gpu, sm86]
     timeout-minutes: 30
-    # The box has a single physical GPU: serialize GPU jobs across PRs instead
-    # of letting concurrent runs clobber each other.
+    # Serialize CUDA jobs across PRs (one RTX 3090). The ROCm job has its
+    # own group: different physical GPU, no contention.
     concurrency:
       group: lucebox3-gpu-runner
       cancel-in-progress: false
@@ -197,15 +197,51 @@ jobs:
     needs: [uv-workspace]
     runs-on: [self-hosted, rocm, gfx1151]
     timeout-minutes: 20
-    # Same single box as gpu-tests: serialize GPU jobs across PRs.
+    # Serialize across PRs per GPU. NOT the same group as the CUDA job:
+    # the combo box has two distinct GPUs (RTX 3090 + Strix iGPU), and a
+    # shared group only holds one waiting job, so the Radeon leg was
+    # chronically displaced ("higher priority waiting request") by every
+    # new CUDA job entering the queue.
     concurrency:
-      group: lucebox3-gpu-runner
+      group: lucebox3-rocm-runner
       cancel-in-progress: false
     steps:
       - uses: actions/checkout@v4
 
+      - name: KFD health (diagnose instead of hanging)
+        # rocminfo on a wedged KFD blocks in uninterruptible sleep and eats
+        # the whole 20-minute job timeout. Probe with a hard timeout first,
+        # and when it hangs, dump the evidence (D-state holders, dmesg) so
+        # the job fails in seconds with a diagnosis instead of silently.
+        run: |
+          # A wedged KFD puts rocminfo in UNINTERRUPTIBLE sleep: timeout(1)
+          # cannot kill it and a foreground wait blocks until the job
+          # timeout. Probe in the background (output to a file so no pipe
+          # keeps the step alive) and enforce the deadline in the shell.
+          /opt/rocm/bin/rocminfo > /tmp/rocminfo.out 2>&1 &
+          PROBE=$!
+          for i in $(seq 1 15); do
+            kill -0 $PROBE 2>/dev/null || break
+            sleep 1
+          done
+          if kill -0 $PROBE 2>/dev/null; then
+            echo "::error::rocminfo hung (likely D-state) — ROCm/KFD wedged; the box needs a reboot"
+            echo "--- probe state:"
+            ps -o pid,stat,wchan:32,comm -p $PROBE || true
+            echo "--- processes holding /dev/kfd:"
+            sudo fuser -v /dev/kfd 2>&1 || true
+            echo "--- D-state processes:"
+            ps -eo pid,user,stat,wchan:32,comm | awk '$3 ~ /D/' || true
+            echo "--- recent amdgpu/kfd dmesg:"
+            sudo dmesg 2>/dev/null | grep -iE "amdgpu|kfd" | tail -15 || true
+            kill -9 $PROBE 2>/dev/null || true
+            disown $PROBE 2>/dev/null || true
+            exit 1
+          fi
+          wait $PROBE && echo "KFD healthy" || { echo "::error::rocminfo exited non-zero"; cat /tmp/rocminfo.out | tail -5; exit 1; }
+
       - name: ROCm smoke (rocminfo sees gfx1151)
-        run: /opt/rocm/bin/rocminfo | grep -E "Name:|Marketing Name:" | grep -iE "gfx1151|Radeon 8060S"
+        run: cat /tmp/rocminfo.out | grep -E "Name:|Marketing Name:" | grep -iE "gfx1151|Radeon 8060S"
 
       - name: Build + run HIP vector-add on the Radeon 8060S
         # Self-contained HIP kernel correctness test (no model weights). This is

diff --git a/README.md b/README.md
@@ -39,6 +39,10 @@ Each one is self-contained with setup instructions and benchmark notes.
   <a href="optimizations/spark/"><img src="assets/cards/spark_card.png" alt="Luce Spark MoE expert offload" width="46%"></a>
 </p>
 
+<p align="center">
+  <a href="optimizations/kvflash/"><img src="assets/cards/kvflash_card.png" alt="Luce KVFlash paged KV cache" width="46%"></a>
+</p>
+
 ---
 
 ## Supported Models & Drafters
@@ -276,6 +280,18 @@ DFLASH27B_KV_TQ3=1 \
 | `--kv-cache-dir <path>` | — | Persist prefix cache to disk |
 | `--kv-cache-budget N` | — | On-disk cache size cap |
 
+**Bounded KV residency (KVFlash)**
+
+Pages the attention KV cache through a fixed pool of GPU slots; cold 64-token chunks live in host RAM, bit-exact and recallable. Decode speed stops depending on context length and resident KV stays pool-sized at any context. Off by default; works on every model family. Drafter-scored residency is the default on every family: the server finds the Qwen3-0.6B drafter next to the model (or via `--prefill-drafter`) and lazy-loads it as the relevance scorer that decides which chunks stay resident — non-qwen targets (laguna, gemma4) bridge the tokenizer gap by re-tokenizing the context text for the drafter. LRU is the fallback when no drafter is present, or the explicit choice via `--kvflash-policy lru`. Per-model numbers in [Luce KVFlash →](optimizations/kvflash/README.md).
+
+| Flag / env | Default | Effect |
+|---|---|---|
+| `--kvflash <tokens\|auto>` | off | Resident pool size. `auto` sizes from the GPU: half of free VRAM after weights and reserves, at the model's KV density, capped where decode speed stays near the flat optimum (default 16384, override `DFLASH_KVFLASH_MAX_POOL`) and at `--max-ctx`. Explicit values are rounded to 256, clamped to `--max-ctx`, floored at the protected minimum so eviction always has a victim. |
+| `--kvflash-policy {drafter,lru}` | `drafter` | Residency policy. `lru` opts out of the drafter probe/load (recency-only paging, no extra VRAM). |
+| `--kvflash-tau N` | `64` | Reselect interval floor (drafter policy only); the effective interval grows with history to cap rescore overhead. |
+| `DFLASH_KVFLASH=N` | off | Env equivalent of `--kvflash`. |
+| `DFLASH_KVFLASH_TAU=N` | `64` | Env equivalent of `--kvflash-tau`. |
+
 **Thinking budget**
 
 | Flag | Default | Effect |

diff --git a/assets/cards/kvflash_card.png b/assets/cards/kvflash_card.png