Luce-Org · easel · Jun 3, 2026
diff --git a/docs/experiments/gemma4-26b-coding-agent-loop-sweep-2026-05-30.md b/docs/experiments/gemma4-26b-coding-agent-loop-sweep-2026-05-30.md
@@ -0,0 +1,136 @@
+# Gemma 4 26B-A4B-it — coding-agent-loop autotune sweep — 2026-05-30
+
+First end-to-end run of the `coding-agent-loop` autotune profile against
+the live gemma-4-26b server on sindri.
+
+* **Host**: sindri (RTX 3090 Ti, 24 GB, WSL2)
+* **Image**: locally-built `lucebox-hub:cuda12` from
+  `feat/lucebox-docker` @ `cb58edb` (sm_86 only; includes the new
+  entrypoint with `DFLASH_FA_WINDOW` plumbing)
+* **Fixture**: one 6-bucket multi-turn replay case from
+  `luce-bench/src/lucebench/fixtures/agent_recorded/multi_turn_cases.json`
+  (single Claude Code session sliced at 8K/16K/32K/64K/100K/128K
+  approx-token buckets per `extract-agentic-fixture.py --multi-turn`)
+* **Profile**: `coding-agent-loop`, gemma bracket =
+  `max_ctx × fa_window × budget × pflash` = `{98304, 131072} ×
+  {0, 2048} × {16, 22, 32} × {off}` = 12 cells
+
+## Bracket + outcome
+
+| # | budget | max_ctx | fa_win | pflash | case_tok* | tok/s | pass |
+|---|---|---|---|---|---|---|---|
+| 1 | 16 | 98304  | 0    | off | 65205 → 90799 | **3.5** | ✓ winner |
+| 2 | 22 | 98304  | 0    | off | 65205 → 90799 | 3.4 | ✓ |
+| 3 | 32 | 98304  | 0    | off | 65205 → 90799 | 3.2 | ✓ |
+| 4 | 16 | 98304  | 2048 | off | 65205 → 90799 | 3.3 | ✓ |
+| 5 | 22 | 98304  | 2048 | off | 65205 → 90799 | 2.8 | ✓ |
+| 6 | 32 | 98304  | 2048 | off | 65205 → 90799 | 3.0 | ✓ |
+| 7 | 16 | 131072 | 0    | off | 102397 → ?    | —   | ✗ HTTP 400 in 0.2s |
+| 8 | 22 | 131072 | 0    | off | 102397 → ?    | —   | ✗ HTTP 400 in 0.2s |
+| 9 | 32 | 131072 | 0    | off | 102397 → ?    | —   | ✗ HTTP 400 in 0.2s |
+| 10 | 16 | 131072 | 2048 | off | 102397 → ?    | —   | ✗ HTTP 400 in 0.2s |
+| 11 | 22 | 131072 | 2048 | off | 102397 → ?    | —   | ✗ HTTP 400 in 0.2s |
+| 12 | 32 | 131072 | 2048 | off | 102397 → ?    | —   | ✗ HTTP 400 in 0.2s |
+
+\*`case_tok` is the picker's `context_tokens_approx` (`chars / 4`) →
+the server's actual `prompt_tokens` after tokenization + chat template
+wrapping. Real gemma tokenization expands by ~1.39× relative to chars/4
+on this fixture.
+
+## Verification: 131K serves the level2 suite on sindri (2026-05-30 evening)
+
+After bragi's sweep showed 131K viable on a 23 GB Laptop, sindri was
+bumped to `max_ctx=131072, budget=22, fa_window=0` and re-ran the
+level2 area set. Drop-in works: no quality regression, longctx still
+100%.
+
+| area | 98K rate | 131K rate | delta |
+|---|---|---|---|
+| smoke | 100% (3/3) | 100% (3/3) | = |
+| code | 10% (1/10) | 10% (1/10) | = |
+| gsm8k | 91% (91/100) | 91% (91/100) | = |
+| truthfulqa-mc1 | 80% (80/100) | 76% (76/100) | −4 pp (stochastic) |
+| hellaswag | 70% (70/100) | 75% (75/100) | +5 pp (stochastic) |
+| agent | 50% (2/4) | 50% (2/4) | = |
+| longctx | 100% (6/6) | 100% (6/6) | = |
+
+VRAM at boot on 131K: 21.1 / 24.6 GiB used; ~3 GiB headroom. The
+longctx-64k cell prefilled 66,853 tokens in 45.9 s (~1450 tok/s
+prefill) and decoded 61 tokens in 955 ms (~64 tok/s decode).
+Snapshot: `…-gemma-131k-verify-2026-05-30-67f4`.
+
+## Correction (added 2026-05-30 after bragi sweep)
+
+The 131K failures below were a **fixture-picker artifact, not a VRAM limit**.
+After `safety_factor` was updated to 0.7, the picker selects the 64K case
+for 131K cells instead of the 100K case, and 131K cells pass on both sindri
+and bragi. See
+`docs/experiments/gemma4-26b-coding-agent-loop-sweep-bragi-2026-05-30.md`
+for the full analysis. Finding 1 below describes what happened mechanically;
+the conclusion "98K is the ceiling" no longer holds.
+
+## Findings
+
+1. **131K cells failed due to fixture selection, not VRAM.** All six
+   98K cells passed; all six 131K cells failed fast with HTTP 400
+   *before* any prefill. The failure mode is request-validation, not
+   OOM — the server's "effort-tier ceiling = max_ctx(131072) − 4096 =
+   126976" rejects requests whose `prompt_tokens` exceed the ceiling.
+
+2. **The picker's `chars/4` token estimate undercounts on real gemma
+   tokenization by ~40%.** The 65K-bucket case (`context_tokens_approx
+   = 65205`) tokenizes to **90799** real tokens. The 102K-bucket case
+   (`context_tokens_approx = 102397`) likely tokenizes to ~130K+ real
+   tokens — over the 126976 ceiling at max_ctx=131072. The picker
+   selected it for the 131K cells, the server rejected it, every
+   131K cell failed identically.
+
+3. **`fa_window` doesn't help at this prompt size on gemma4-26b.**
+   `fa_window=0` (full attention, server default) beat `fa_window=2048`
+   in every (budget, max_ctx) cell. The differences are small (~3-7%)
+   but consistent. fa_window's sparse-decode optimization is wasted
+   compute on a 26B-A4B-MoE model where decode bandwidth isn't the
+   bottleneck at 90K tokens.
+
+4. **`budget` axis is nearly flat at 90K prompt size.** 16/22/32 produce
+   3.5/3.4/3.2 tok/s — small enough margin that noise dominates. The
+   heuristic default of `budget=22` is fine; the sweep's preference for
+   `budget=16` is within run-to-run variance.
+
+5. **Decode throughput at 90K prompt: ~3.5 tok/s.** Mostly prefill cost:
+   wall=72s, ~256 completion tokens, so decode-phase is ~30s for 256
+   tokens (~8.5 tok/s decode-only). Prefill of 90K tokens takes ~40s
+   on a 3090 Ti — about 2250 tok/s prefill rate.
+
+## Heuristic update (gemma4 24 GB WSL)
+
+Bump `runtime_from_host()` for the 22-31 GB / WSL tier from
+`max_ctx=65536` to `max_ctx=98304`. Empirical evidence that 98K serves
+real agentic traces with reasonable headroom (90K real prompts pass
+with ~3 GB VRAM unused). Keep `budget=16` and the existing defaults.
+
+131K remains plausible as a manual operator setting (proven to boot
+2026-05-29; serves short prompts) but not as a default — the sweep
+fixture overshoots its prompt budget, and we lack a long-prompt case
+sized for the real 126976-token ceiling. Future work:
+
+* Fix the picker's safety factor (use ~0.7× the approximate budget)
+  or re-tokenize fixtures with the real gemma tokenizer at extraction
+  time.
+* Re-run the 131K cells with a properly-sized case (~110K real tokens)
+  to confirm 131K serves agentic workloads, not just short prompts.
+
+## Reproducing
+
+```sh
+# From the worktree, with LUCEBOX_HOST_* env unset (sweep falls back
+# to the persisted [host] block in config.toml):
+cd /home/erik/Projects/lucebox-hub-285
+uv run --project lucebox python -m lucebox autotune \
+    --sweep --profile coding-agent-loop --yes
+```
+
+Raw output captured at
+`/tmp/sweep-gemma-coding-agent-loop.log` during the 2026-05-30 run
+(local-only; not checked into the repo because the per-cell server
+restarts produce ~MB of progress noise).
diff --git a/docs/experiments/gemma4-26b-coding-agent-loop-sweep-bragi-2026-05-30.md b/docs/experiments/gemma4-26b-coding-agent-loop-sweep-bragi-2026-05-30.md
@@ -0,0 +1,129 @@
+# Gemma 4 26B-A4B-it — coding-agent-loop autotune sweep — bragi — 2026-05-30
+
+Second run of the `coding-agent-loop` autotune profile against gemma-4-26b;
+first run on bragi (Blackwell sm_120). Corrects an incorrect conclusion from
+the earlier sindri sweep where all 131K cells appeared to fail.
+
+* **Host**: bragi (RTX 5090 Laptop MaxQ, 23 GB VRAM, WSL2, sm_120)
+  * **Note**: GPU at ~86–90 W / 1515 MHz (Windows Balanced mode; WSL2 cannot
+    set TDP). At full performance (150–175 W) decode rate would be ~50–60 tok/s
+    vs the ~30 tok/s observed here.
+* **Image**: locally-built `lucebox-hub:cuda12` from
+  `feat/lucebox-docker` @ `48fafe6` (DFLASH_CUDA_ARCHES=120)
+* **Fixture**: one 6-bucket multi-turn replay case from
+  `luce-bench/src/lucebench/fixtures/agent_recorded/multi_turn_cases.json`
+  (case `claude-2026-05-23-multiturn-65536-65eed`, 65205 approx-token bucket)
+* **Profile**: `coding-agent-loop`, gemma bracket =
+  `max_ctx × {98304, 131072} × fa_window × {0, 2048} × budget × {16, 22, 32}` = 12 cells
+
+## Bracket + outcome
+
+| # | budget | max_ctx | fa_win | case_tok* | tok/s  | pass     |
+|---|--------|---------|--------|-----------|--------|----------|
+| 1 | 16     | 98304   | 0      | 65205     | 2.0    | ✓        |
+| 2 | 22     | 98304   | 0      | 65205     | 1.9    | ✓        |
+| 3 | 32     | 98304   | 0      | 65205     | 2.0    | ✓        |
+| 4 | 16     | 98304   | 2048   | 65205     | 1.9    | ✓        |
+| 5 | 22     | 98304   | 2048   | 65205     | 2.0    | ✓        |
+| 6 | 32     | 98304   | 2048   | 65205     | 2.0    | ✓        |
+| 7 | 16     | 131072  | 0      | 65205     | 2.0    | ✓        |
+| 8 | 22     | 131072  | 0      | 65205     | 2.0    | ✓ **winner** |
+| 9 | 32     | 131072  | 0      | 65205     | 2.0    | ✓        |
+| 10 | 16    | 131072  | 2048   | 65205     | 2.0    | ✓        |
+| 11 | 22    | 131072  | 2048   | 65205     | 1.9    | ✓        |
+| 12 | 32    | 131072  | 2048   | 65205     | 2.0    | ✓        |
+
+\*`case_tok` is the picker's `context_tokens_approx` (chars/4). The actual
+real token count after Gemma tokenization + chat template wrapping is
+**~90K** (1.39× expansion). All cells used the same 64K-bucket case.
+
+Winner: cell 8 (budget=22, max_ctx=131072, fa_window=0, 2.0 tok/s). Cells 7
+and 8 both scored 2.0 tok/s, but cell 8's wall time (63.9 s vs 64.4 s) gave
+it a marginally higher float speed_metric, beating cell 7 (budget=16) on the
+primary sort key before the budget tiebreaker fired.
+
+## Findings
+
+### 1. Gemma 4 26B fits at 131K context on 23 GB VRAM — confirmed
+
+All 12 cells passed, including all 6 at max_ctx=131072. VRAM breakdown:
+- Model weights (Gemma 26B-A4B Q4_K_M + draft): ~14–15 GB
+- KV cache F16 at 131072 ctx (GQA, ~4 KV heads, 256 head dim, 30 layers):
+  ~7–8 GB
+- Total: **~22–23 GB** — fits on bragi's 23 GB with ~1 GB headroom
+
+The KV cache is allocated upfront for max_ctx tokens at server startup.
+Since all 131K cells started and responded, the allocation succeeded. The
+headroom is slim — this config sits at the edge of the hardware.
+
+### 2. Why sindri appeared to fail at 131K (fixture picker issue)
+
+The sindri sweep (`gemma4-26b-coding-agent-loop-sweep-2026-05-30.md`)
+reported all 131K cells failing with HTTP 400. At the time, the fixture
+picker selected the **100K-bucket case** (`context_tokens_approx ≈ 102397`)
+for max_ctx=131072. Gemma expands that by ~1.39×: 102397 × 1.39 ≈ 142K
+real tokens, exceeding the server's 131072 − 4096 = **126976** ceiling.
+
+On bragi today, the picker selected the **64K-bucket case**
+(`context_tokens_approx = 65205`) for both 98304 and 131072, which expands
+to ~90K real tokens — well within 126976. The picker's
+`safety_factor=0.7` was likely updated between the two runs, changing the
+effective budget threshold from `1.0 × (max_ctx − 4096)` to
+`0.7 × (max_ctx − 4096)`:
+
+- Old: effective_budget = 126976 × 1.0 = 126976 → 100K case (102397) fits ✓
+- New: effective_budget = 126976 × 0.7 = 88883 → 64K case (65205) fits ✓,
+  100K case (102397) does not ✗
+
+So sindri's 131K failures were a **fixture selection artifact, not a VRAM
+limit**. The hardware could handle it; the test picked a case that was too
+large for the server's request ceiling.
+
+### 3. fa_window gives no benefit for Gemma 4 at 90K-token context
+
+fa_window=0 (full attention) and fa_window=2048 produced identical throughput
+(2.0 tok/s, within noise) across all budget/max_ctx combinations. This
+replicates the sindri finding: Gemma 4 26B-A4B's decode is not
+bandwidth-bound at this scale in a way that sparse-attention windowing
+improves. fa_window=0 is the recommended default.
+
+### 4. Budget insensitive on Gemma's MoE architecture
+
+budget=16, 22, and 32 all score ~2.0 tok/s (within ±0.1 tok/s noise) at
+both context sizes. The draft budget has minimal leverage on Gemma 4 26B-A4B:
+the 4B-active MoE decoder is already fast enough that more speculative tokens
+don't meaningfully amortize verification cost. budget=22 (the heuristic
+default) is fine; there's no need to tune this axis further.
+
+### 5. Gemma is faster than Qwen3.6 at long context
+
+At 98K context (both using the same 64K case):
+- Gemma 4 26B-A4B: **2.0 tok/s**, 64 s wall (90K actual tokens)
+- Qwen3.6 27B: **1.2 tok/s**, 209 s wall (~85K actual tokens)
+
+Gemma's 4B-active MoE architecture decodes ~67% faster than Qwen3.6's denser
+27B at equivalent real-token prompt sizes.
+
+## Heuristic updates applied
+
+**`autotune.py` — `_coding_agent_loop_gemma_bracket()` docstring:**
+Updated to note that 131K is confirmed viable on 23–24 GB VRAM. Removed the
+implication that 131K cells fail. The old sindri conclusion was a
+fixture-picker artifact, not a hardware constraint.
+
+No code change to the bracket itself — it already correctly sweeps both
+98304 and 131072. The winner selection (sort by max_ctx first) will
+automatically prefer 131072 cells if they pass.
+
+## Recommended config (bragi, Gemma 4 26B, 23 GB VRAM WSL2)
+
+```toml
+[dflash]
+budget = 22
+max_ctx = 131072
+fa_window = 0
+```
+
+Prefill throughput at 90K real tokens: ~240 s wall (~375 tok/s). Decode
+throughput: **~2.0 tok/s** speculative, 126-token response. The 131K ceiling
+accommodates real coding-agent sessions up to ~120K real tokens.
diff --git a/docs/experiments/qwen3.6-27b-coding-agent-loop-sweep-bragi-2026-05-30.md b/docs/experiments/qwen3.6-27b-coding-agent-loop-sweep-bragi-2026-05-30.md
@@ -0,0 +1,132 @@
+# Qwen3.6-27B-Q4_K_M — coding-agent-loop autotune sweep — bragi — 2026-05-30
+
+First end-to-end run of the `coding-agent-loop` autotune profile on
+Qwen3.6-27B on bragi, a consumer Blackwell laptop.
+
+* **Host**: bragi (RTX 5090 Laptop MaxQ, 23 GB VRAM, WSL2, sm_120)
+  * **Note**: GPU running at ~86–90 W / 1515 MHz during this run (Windows
+    Balanced power mode; WSL2 cannot set TDP). Full-performance mode
+    (Best performance) would yield ~150–175 W / 2500+ MHz and ~40–50 tok/s
+    decode vs the 24–25 tok/s observed here.
+* **Image**: locally-built `lucebox-hub:cuda12` from
+  `feat/lucebox-docker` @ `48fafe6` (DFLASH_CUDA_ARCHES=120, sm_120 fat
+  binary)
+* **Fixture**: one 6-bucket multi-turn replay case from
+  `luce-bench/src/lucebench/fixtures/agent_recorded/multi_turn_cases.json`
+  (single Claude Code session sliced at 8K/16K/32K/64K/100K/128K
+  approx-token buckets per `extract-agentic-fixture.py --multi-turn`)
+* **Profile**: `coding-agent-loop`, qwen bracket =
+  `max_ctx × cache_type × budget × fa_window` =
+  `{65536, 98304} × {tq3_0, q8_0} × {16, 22, 32} × {0}` = 12 cells
+
+## Bracket + outcome
+
+| # | budget | max_ctx | kv    | case_tok*       | tok/s | pass       |
+|---|--------|---------|-------|-----------------|-------|------------|
+| 1 | 16     | 65536   | tq3_0 | 32768 → 42735   | 3.1   | ✓          |
+| 2 | 22     | 65536   | tq3_0 | 32768 → 42735   | 3.1   | ✓          |
+| 3 | 32     | 65536   | tq3_0 | 32768 → 42735   | —     | ✗ timeout  |
+| 4 | 16     | 65536   | q8_0  | 32768 → 42735   | 4.0   | ✓          |
+| 5 | 22     | 65536   | q8_0  | 32768 → 42735   | —     | ✗ timeout  |
+| 6 | 32     | 65536   | q8_0  | 32768 → 42735   | —     | ✗ timeout  |
+| 7 | 16     | 98304   | tq3_0 | 65536 → ~85500  | 1.2   | ✓ **winner** |
+| 8 | 22     | 98304   | tq3_0 | 65536 → ~85500  | 1.2   | ✓          |
+| 9 | 32     | 98304   | tq3_0 | 65536 → ~85500  | 1.2   | ✓          |
+| 10 | 16    | 98304   | q8_0  | 65536 → ~85500  | —     | ✗ timeout  |
+| 11 | 22    | 98304   | q8_0  | 65536 → ~85500  | —     | ✗ timeout  |
+| 12 | 32    | 98304   | q8_0  | 65536 → ~85500  | —     | ✗ timeout  |
+
+\*`case_tok` = picker's `context_tokens_approx` (chars/4) → estimated
+real token count after Qwen3.6 tokenization. Real Qwen3.6 tokenization
+expands by ~**1.30×** relative to chars/4 on this fixture (32768 approx
+→ 42,735 real tokens; 65536 approx → ~85K real tokens).
+
+## Findings
+
+### 1. tq3_0 is required at 98K context on 23 GB VRAM
+
+All six q8_0 cells at `max_ctx=98304` timed out (300 s, no response).
+All three tq3_0 cells at `max_ctx=98304` passed (208–219 s wall time).
+
+VRAM breakdown:
+- Model weights (Qwen3.6-27B Q4_K_M + draft): ~18–19 GB
+- KV cache q8_0 at 98304 ctx: ~5–6 GB → total **24–25 GB** → OOM on 23 GB
+- KV cache tq3_0 at 98304 ctx: ~2–3 GB → total **21–22 GB** → ~1–2 GB headroom
+
+The timeouts are silent VRAM OOM crashes: the container exits during
+server startup (no OOM error in the log — the GPU driver kills the
+process), the readiness probe never succeeds, and the 300 s timeout fires.
+
+### 2. q8_0 is faster for short-context inference but only at low budget
+
+At `max_ctx=65536`, `budget=16`, `kv=q8_0` achieves **4.0 tok/s** vs
+**3.1 tok/s** for tq3_0 (+29%). This is likely because q8_0 KV lookup
+avoids dequantization overhead that tq3_0 pays per head.
+
+However, q8_0 only survives budget=16 at 65536 (budget=22 and 32 timeout).
+On this 23 GB card, even at 65536 context, q8_0 + budget=22/32 pushes
+VRAM past the limit.
+
+### 3. budget=32 is unreliable at 65536 context
+
+`tq3_0 + budget=32 + max_ctx=65536` timed out despite `budget=16` and
+`budget=22` passing at 82–83 s. This aligns with finding #2: higher
+budget → more speculative decode state → marginally more VRAM → OOM edge.
+
+At `max_ctx=98304`, budget=32 is fine (219 s vs 208 s for budget=16) —
+the tq3_0 KV savings provide enough headroom that the extra budget state
+fits.
+
+### 4. Speed metrics are not comparable across max_ctx values
+
+The fixture picker selects the largest case that fits within
+`max_ctx − 4096 × 0.7 safety factor`. At `max_ctx=65536` it picks the
+32K case (42K real tokens); at `max_ctx=98304` it picks the 64K case
+(~85K real tokens). The 65K-cell 4.0 tok/s looks better than the 98K-cell
+1.2 tok/s, but they measured different amounts of work — not the same
+workload on different configs.
+
+A sweeper sorting by tok/s would pick 65536/q8_0/b16 as the "winner",
+which would silently cap real agentic sessions at 64K and OOM on longer
+ones. The winner selection was updated (see below) to prefer larger
+max_ctx first.
+
+### 5. Qwen3.6 tokenizer expansion: 1.30× on this fixture
+
+The 32K-bucket case has `context_tokens_approx = 32768` (chars/4 estimate)
+but the server reports **42,735** real prompt tokens after Qwen3.6
+tokenization + chat template wrapping. Expansion ratio: **1.30×**. Compare:
+gemma-4-26b on sindri showed ~1.39× on the same fixture.
+
+## Heuristic updates applied
+
+**`autotune.py` — `runtime_from_host()` for 22-31 GB tier:**
+Explicitly set `cache_type_k="tq3_0", cache_type_v="tq3_0"` for both WSL
+and native 22-31 GB paths. Previously the field was left empty (server
+default), which could be q8_0 or f16 — both OOM at 98K on 23 GB VRAM.
+
+**`autotune.py` — `_coding_agent_loop_qwen_bracket()` for 22-31 GB:**
+Skip q8_0 when `max_ctx >= 98304`. Previously all 12 cells were generated;
+the 6 q8_0/98K cells always fail on 23 GB hardware, wasting ~30 min of
+sweep time. Reduced to 9 cells (tq3_0+q8_0 at 65K, tq3_0-only at 98K).
+
+**`sweep.py` — `_pick_winner()` for `agent_replay_pass_rate`:**
+Changed primary sort key from `-speed_metric` to `-max_ctx`. Rationale:
+different max_ctx values exercise different-sized fixture cases (see
+finding #4). Speed is only meaningful within the same max_ctx group. The
+corrected sort ensures the winner always uses the largest viable context
+window, then optimizes speed within that group.
+
+## Recommended config (bragi, Qwen3.6-27B, 23 GB VRAM WSL2)
+
+```toml
+[dflash]
+budget = 16
+max_ctx = 98304
+cache_type_k = "tq3_0"
+cache_type_v = "tq3_0"
+```
+
+Prefill throughput: ~500 tok/s. Decode throughput at 85K-token context:
+**~1.2 tok/s** (speculative decode, 256-token response). Wall time for a
+full 90K-token agentic session: ~210 s to first token, then ~1.2 tok/s.