Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions docs/experiments/gemma4-26b-coding-agent-loop-sweep-2026-05-30.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Gemma 4 26B-A4B-it — coding-agent-loop autotune sweep — 2026-05-30

First end-to-end run of the `coding-agent-loop` autotune profile against
the live gemma-4-26b server on sindri.

* **Host**: sindri (RTX 3090 Ti, 24 GB, WSL2)
* **Image**: locally-built `lucebox-hub:cuda12` from
`feat/lucebox-docker` @ `cb58edb` (sm_86 only; includes the new
entrypoint with `DFLASH_FA_WINDOW` plumbing)
* **Fixture**: one 6-bucket multi-turn replay case from
`luce-bench/src/lucebench/fixtures/agent_recorded/multi_turn_cases.json`
(single Claude Code session sliced at 8K/16K/32K/64K/100K/128K
approx-token buckets per `extract-agentic-fixture.py --multi-turn`)
* **Profile**: `coding-agent-loop`, gemma bracket =
`max_ctx × fa_window × budget × pflash` = `{98304, 131072} ×
{0, 2048} × {16, 22, 32} × {off}` = 12 cells

## Bracket + outcome

| # | budget | max_ctx | fa_win | pflash | case_tok* | tok/s | pass |
|---|---|---|---|---|---|---|---|
| 1 | 16 | 98304 | 0 | off | 65205 → 90799 | **3.5** | ✓ winner |
| 2 | 22 | 98304 | 0 | off | 65205 → 90799 | 3.4 | ✓ |
| 3 | 32 | 98304 | 0 | off | 65205 → 90799 | 3.2 | ✓ |
| 4 | 16 | 98304 | 2048 | off | 65205 → 90799 | 3.3 | ✓ |
| 5 | 22 | 98304 | 2048 | off | 65205 → 90799 | 2.8 | ✓ |
| 6 | 32 | 98304 | 2048 | off | 65205 → 90799 | 3.0 | ✓ |
| 7 | 16 | 131072 | 0 | off | 102397 → ? | — | ✗ HTTP 400 in 0.2s |
| 8 | 22 | 131072 | 0 | off | 102397 → ? | — | ✗ HTTP 400 in 0.2s |
| 9 | 32 | 131072 | 0 | off | 102397 → ? | — | ✗ HTTP 400 in 0.2s |
| 10 | 16 | 131072 | 2048 | off | 102397 → ? | — | ✗ HTTP 400 in 0.2s |
| 11 | 22 | 131072 | 2048 | off | 102397 → ? | — | ✗ HTTP 400 in 0.2s |
| 12 | 32 | 131072 | 2048 | off | 102397 → ? | — | ✗ HTTP 400 in 0.2s |

\*`case_tok` is the picker's `context_tokens_approx` (`chars / 4`) →
the server's actual `prompt_tokens` after tokenization + chat template
wrapping. Real gemma tokenization expands by ~1.39× relative to chars/4
on this fixture.

## Verification: 131K serves the level2 suite on sindri (2026-05-30 evening)

After bragi's sweep showed 131K viable on a 23 GB Laptop, sindri was
bumped to `max_ctx=131072, budget=22, fa_window=0` and re-ran the
level2 area set. Drop-in works: no quality regression, longctx still
100%.

| area | 98K rate | 131K rate | delta |
|---|---|---|---|
| smoke | 100% (3/3) | 100% (3/3) | = |
| code | 10% (1/10) | 10% (1/10) | = |
| gsm8k | 91% (91/100) | 91% (91/100) | = |
| truthfulqa-mc1 | 80% (80/100) | 76% (76/100) | −4 pp (stochastic) |
| hellaswag | 70% (70/100) | 75% (75/100) | +5 pp (stochastic) |
| agent | 50% (2/4) | 50% (2/4) | = |
| longctx | 100% (6/6) | 100% (6/6) | = |

VRAM at boot on 131K: 21.1 / 24.6 GiB used; ~3 GiB headroom. The
longctx-64k cell prefilled 66,853 tokens in 45.9 s (~1450 tok/s
prefill) and decoded 61 tokens in 955 ms (~64 tok/s decode).
Snapshot: `…-gemma-131k-verify-2026-05-30-67f4`.

## Correction (added 2026-05-30 after bragi sweep)

The 131K failures below were a **fixture-picker artifact, not a VRAM limit**.
After `safety_factor` was updated to 0.7, the picker selects the 64K case
for 131K cells instead of the 100K case, and 131K cells pass on both sindri
and bragi. See
`docs/experiments/gemma4-26b-coding-agent-loop-sweep-bragi-2026-05-30.md`
for the full analysis. Finding 1 below describes what happened mechanically;
the conclusion "98K is the ceiling" no longer holds.

## Findings

1. **131K cells failed due to fixture selection, not VRAM.** All six
98K cells passed; all six 131K cells failed fast with HTTP 400
*before* any prefill. The failure mode is request-validation, not
OOM — the server's "effort-tier ceiling = max_ctx(131072) − 4096 =
126976" rejects requests whose `prompt_tokens` exceed the ceiling.

2. **The picker's `chars/4` token estimate undercounts on real gemma
tokenization by ~40%.** The 65K-bucket case (`context_tokens_approx
= 65205`) tokenizes to **90799** real tokens. The 102K-bucket case
(`context_tokens_approx = 102397`) likely tokenizes to ~130K+ real
tokens — over the 126976 ceiling at max_ctx=131072. The picker
selected it for the 131K cells, the server rejected it, every
131K cell failed identically.

3. **`fa_window` doesn't help at this prompt size on gemma4-26b.**
`fa_window=0` (full attention, server default) beat `fa_window=2048`
in every (budget, max_ctx) cell. The differences are small (~3-7%)
but consistent. fa_window's sparse-decode optimization is wasted
compute on a 26B-A4B-MoE model where decode bandwidth isn't the
bottleneck at 90K tokens.

4. **`budget` axis is nearly flat at 90K prompt size.** 16/22/32 produce
3.5/3.4/3.2 tok/s — small enough margin that noise dominates. The
heuristic default of `budget=22` is fine; the sweep's preference for
`budget=16` is within run-to-run variance.

5. **Decode throughput at 90K prompt: ~3.5 tok/s.** Mostly prefill cost:
wall=72s, ~256 completion tokens, so decode-phase is ~30s for 256
tokens (~8.5 tok/s decode-only). Prefill of 90K tokens takes ~40s
on a 3090 Ti — about 2250 tok/s prefill rate.

## Heuristic update (gemma4 24 GB WSL)

Bump `runtime_from_host()` for the 22-31 GB / WSL tier from
`max_ctx=65536` to `max_ctx=98304`. Empirical evidence that 98K serves
real agentic traces with reasonable headroom (90K real prompts pass
with ~3 GB VRAM unused). Keep `budget=16` and the existing defaults.

131K remains plausible as a manual operator setting (proven to boot
2026-05-29; serves short prompts) but not as a default — the sweep
fixture overshoots its prompt budget, and we lack a long-prompt case
sized for the real 126976-token ceiling. Future work:

* Fix the picker's safety factor (use ~0.7× the approximate budget)
or re-tokenize fixtures with the real gemma tokenizer at extraction
time.
* Re-run the 131K cells with a properly-sized case (~110K real tokens)
to confirm 131K serves agentic workloads, not just short prompts.

## Reproducing

```sh
# From the worktree, with LUCEBOX_HOST_* env unset (sweep falls back
# to the persisted [host] block in config.toml):
cd /home/erik/Projects/lucebox-hub-285
uv run --project lucebox python -m lucebox autotune \
--sweep --profile coding-agent-loop --yes
```

Raw output captured at
`/tmp/sweep-gemma-coding-agent-loop.log` during the 2026-05-30 run
(local-only; not checked into the repo because the per-cell server
restarts produce ~MB of progress noise).
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Gemma 4 26B-A4B-it — coding-agent-loop autotune sweep — bragi — 2026-05-30

Second run of the `coding-agent-loop` autotune profile against gemma-4-26b;
first run on bragi (Blackwell sm_120). Corrects an incorrect conclusion from
the earlier sindri sweep where all 131K cells appeared to fail.

* **Host**: bragi (RTX 5090 Laptop MaxQ, 23 GB VRAM, WSL2, sm_120)
* **Note**: GPU at ~86–90 W / 1515 MHz (Windows Balanced mode; WSL2 cannot
set TDP). At full performance (150–175 W) decode rate would be ~50–60 tok/s
vs the ~30 tok/s observed here.
* **Image**: locally-built `lucebox-hub:cuda12` from
`feat/lucebox-docker` @ `48fafe6` (DFLASH_CUDA_ARCHES=120)
* **Fixture**: one 6-bucket multi-turn replay case from
`luce-bench/src/lucebench/fixtures/agent_recorded/multi_turn_cases.json`
(case `claude-2026-05-23-multiturn-65536-65eed`, 65205 approx-token bucket)
* **Profile**: `coding-agent-loop`, gemma bracket =
`max_ctx × {98304, 131072} × fa_window × {0, 2048} × budget × {16, 22, 32}` = 12 cells

## Bracket + outcome

| # | budget | max_ctx | fa_win | case_tok* | tok/s | pass |
|---|--------|---------|--------|-----------|--------|----------|
| 1 | 16 | 98304 | 0 | 65205 | 2.0 | ✓ |
| 2 | 22 | 98304 | 0 | 65205 | 1.9 | ✓ |
| 3 | 32 | 98304 | 0 | 65205 | 2.0 | ✓ |
| 4 | 16 | 98304 | 2048 | 65205 | 1.9 | ✓ |
| 5 | 22 | 98304 | 2048 | 65205 | 2.0 | ✓ |
| 6 | 32 | 98304 | 2048 | 65205 | 2.0 | ✓ |
| 7 | 16 | 131072 | 0 | 65205 | 2.0 | ✓ |
| 8 | 22 | 131072 | 0 | 65205 | 2.0 | ✓ **winner** |
| 9 | 32 | 131072 | 0 | 65205 | 2.0 | ✓ |
| 10 | 16 | 131072 | 2048 | 65205 | 2.0 | ✓ |
| 11 | 22 | 131072 | 2048 | 65205 | 1.9 | ✓ |
| 12 | 32 | 131072 | 2048 | 65205 | 2.0 | ✓ |

\*`case_tok` is the picker's `context_tokens_approx` (chars/4). The actual
real token count after Gemma tokenization + chat template wrapping is
**~90K** (1.39× expansion). All cells used the same 64K-bucket case.

Winner: cell 8 (budget=22, max_ctx=131072, fa_window=0, 2.0 tok/s). Cells 7
and 8 both scored 2.0 tok/s, but cell 8's wall time (63.9 s vs 64.4 s) gave
it a marginally higher float speed_metric, beating cell 7 (budget=16) on the
primary sort key before the budget tiebreaker fired.

## Findings

### 1. Gemma 4 26B fits at 131K context on 23 GB VRAM — confirmed

All 12 cells passed, including all 6 at max_ctx=131072. VRAM breakdown:
- Model weights (Gemma 26B-A4B Q4_K_M + draft): ~14–15 GB
- KV cache F16 at 131072 ctx (GQA, ~4 KV heads, 256 head dim, 30 layers):
~7–8 GB
- Total: **~22–23 GB** — fits on bragi's 23 GB with ~1 GB headroom

The KV cache is allocated upfront for max_ctx tokens at server startup.
Since all 131K cells started and responded, the allocation succeeded. The
headroom is slim — this config sits at the edge of the hardware.

### 2. Why sindri appeared to fail at 131K (fixture picker issue)

The sindri sweep (`gemma4-26b-coding-agent-loop-sweep-2026-05-30.md`)
reported all 131K cells failing with HTTP 400. At the time, the fixture
picker selected the **100K-bucket case** (`context_tokens_approx ≈ 102397`)
for max_ctx=131072. Gemma expands that by ~1.39×: 102397 × 1.39 ≈ 142K
real tokens, exceeding the server's 131072 − 4096 = **126976** ceiling.

On bragi today, the picker selected the **64K-bucket case**
(`context_tokens_approx = 65205`) for both 98304 and 131072, which expands
to ~90K real tokens — well within 126976. The picker's
`safety_factor=0.7` was likely updated between the two runs, changing the
effective budget threshold from `1.0 × (max_ctx − 4096)` to
`0.7 × (max_ctx − 4096)`:

- Old: effective_budget = 126976 × 1.0 = 126976 → 100K case (102397) fits ✓
- New: effective_budget = 126976 × 0.7 = 88883 → 64K case (65205) fits ✓,
100K case (102397) does not ✗

So sindri's 131K failures were a **fixture selection artifact, not a VRAM
limit**. The hardware could handle it; the test picked a case that was too
large for the server's request ceiling.

### 3. fa_window gives no benefit for Gemma 4 at 90K-token context

fa_window=0 (full attention) and fa_window=2048 produced identical throughput
(2.0 tok/s, within noise) across all budget/max_ctx combinations. This
replicates the sindri finding: Gemma 4 26B-A4B's decode is not
bandwidth-bound at this scale in a way that sparse-attention windowing
improves. fa_window=0 is the recommended default.

### 4. Budget insensitive on Gemma's MoE architecture

budget=16, 22, and 32 all score ~2.0 tok/s (within ±0.1 tok/s noise) at
both context sizes. The draft budget has minimal leverage on Gemma 4 26B-A4B:
the 4B-active MoE decoder is already fast enough that more speculative tokens
don't meaningfully amortize verification cost. budget=22 (the heuristic
default) is fine; there's no need to tune this axis further.

### 5. Gemma is faster than Qwen3.6 at long context

At 98K context (both using the same 64K case):
- Gemma 4 26B-A4B: **2.0 tok/s**, 64 s wall (90K actual tokens)
- Qwen3.6 27B: **1.2 tok/s**, 209 s wall (~85K actual tokens)

Gemma's 4B-active MoE architecture decodes ~67% faster than Qwen3.6's denser
27B at equivalent real-token prompt sizes.

## Heuristic updates applied

**`autotune.py` — `_coding_agent_loop_gemma_bracket()` docstring:**
Updated to note that 131K is confirmed viable on 23–24 GB VRAM. Removed the
implication that 131K cells fail. The old sindri conclusion was a
fixture-picker artifact, not a hardware constraint.

No code change to the bracket itself — it already correctly sweeps both
98304 and 131072. The winner selection (sort by max_ctx first) will
automatically prefer 131072 cells if they pass.

## Recommended config (bragi, Gemma 4 26B, 23 GB VRAM WSL2)

```toml
[dflash]
budget = 22
max_ctx = 131072
fa_window = 0
```

Prefill throughput at 90K real tokens: ~240 s wall (~375 tok/s). Decode
throughput: **~2.0 tok/s** speculative, 126-token response. The 131K ceiling
accommodates real coding-agent sessions up to ~120K real tokens.
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Qwen3.6-27B-Q4_K_M — coding-agent-loop autotune sweep — bragi — 2026-05-30

First end-to-end run of the `coding-agent-loop` autotune profile on
Qwen3.6-27B on bragi, a consumer Blackwell laptop.

* **Host**: bragi (RTX 5090 Laptop MaxQ, 23 GB VRAM, WSL2, sm_120)
* **Note**: GPU running at ~86–90 W / 1515 MHz during this run (Windows
Balanced power mode; WSL2 cannot set TDP). Full-performance mode
(Best performance) would yield ~150–175 W / 2500+ MHz and ~40–50 tok/s
decode vs the 24–25 tok/s observed here.
* **Image**: locally-built `lucebox-hub:cuda12` from
`feat/lucebox-docker` @ `48fafe6` (DFLASH_CUDA_ARCHES=120, sm_120 fat
binary)
* **Fixture**: one 6-bucket multi-turn replay case from
`luce-bench/src/lucebench/fixtures/agent_recorded/multi_turn_cases.json`
(single Claude Code session sliced at 8K/16K/32K/64K/100K/128K
approx-token buckets per `extract-agentic-fixture.py --multi-turn`)
* **Profile**: `coding-agent-loop`, qwen bracket =
`max_ctx × cache_type × budget × fa_window` =
`{65536, 98304} × {tq3_0, q8_0} × {16, 22, 32} × {0}` = 12 cells

## Bracket + outcome

| # | budget | max_ctx | kv | case_tok* | tok/s | pass |
|---|--------|---------|-------|-----------------|-------|------------|
| 1 | 16 | 65536 | tq3_0 | 32768 → 42735 | 3.1 | ✓ |
| 2 | 22 | 65536 | tq3_0 | 32768 → 42735 | 3.1 | ✓ |
| 3 | 32 | 65536 | tq3_0 | 32768 → 42735 | — | ✗ timeout |
| 4 | 16 | 65536 | q8_0 | 32768 → 42735 | 4.0 | ✓ |
| 5 | 22 | 65536 | q8_0 | 32768 → 42735 | — | ✗ timeout |
| 6 | 32 | 65536 | q8_0 | 32768 → 42735 | — | ✗ timeout |
| 7 | 16 | 98304 | tq3_0 | 65536 → ~85500 | 1.2 | ✓ **winner** |
| 8 | 22 | 98304 | tq3_0 | 65536 → ~85500 | 1.2 | ✓ |
| 9 | 32 | 98304 | tq3_0 | 65536 → ~85500 | 1.2 | ✓ |
| 10 | 16 | 98304 | q8_0 | 65536 → ~85500 | — | ✗ timeout |
| 11 | 22 | 98304 | q8_0 | 65536 → ~85500 | — | ✗ timeout |
| 12 | 32 | 98304 | q8_0 | 65536 → ~85500 | — | ✗ timeout |

\*`case_tok` = picker's `context_tokens_approx` (chars/4) → estimated
real token count after Qwen3.6 tokenization. Real Qwen3.6 tokenization
expands by ~**1.30×** relative to chars/4 on this fixture (32768 approx
→ 42,735 real tokens; 65536 approx → ~85K real tokens).

## Findings

### 1. tq3_0 is required at 98K context on 23 GB VRAM

All six q8_0 cells at `max_ctx=98304` timed out (300 s, no response).
All three tq3_0 cells at `max_ctx=98304` passed (208–219 s wall time).

VRAM breakdown:
- Model weights (Qwen3.6-27B Q4_K_M + draft): ~18–19 GB
- KV cache q8_0 at 98304 ctx: ~5–6 GB → total **24–25 GB** → OOM on 23 GB
- KV cache tq3_0 at 98304 ctx: ~2–3 GB → total **21–22 GB** → ~1–2 GB headroom

The timeouts are silent VRAM OOM crashes: the container exits during
server startup (no OOM error in the log — the GPU driver kills the
process), the readiness probe never succeeds, and the 300 s timeout fires.

### 2. q8_0 is faster for short-context inference but only at low budget

At `max_ctx=65536`, `budget=16`, `kv=q8_0` achieves **4.0 tok/s** vs
**3.1 tok/s** for tq3_0 (+29%). This is likely because q8_0 KV lookup
avoids dequantization overhead that tq3_0 pays per head.

However, q8_0 only survives budget=16 at 65536 (budget=22 and 32 timeout).
On this 23 GB card, even at 65536 context, q8_0 + budget=22/32 pushes
VRAM past the limit.

### 3. budget=32 is unreliable at 65536 context

`tq3_0 + budget=32 + max_ctx=65536` timed out despite `budget=16` and
`budget=22` passing at 82–83 s. This aligns with finding #2: higher
budget → more speculative decode state → marginally more VRAM → OOM edge.

At `max_ctx=98304`, budget=32 is fine (219 s vs 208 s for budget=16) —
the tq3_0 KV savings provide enough headroom that the extra budget state
fits.

### 4. Speed metrics are not comparable across max_ctx values

The fixture picker selects the largest case that fits within
`max_ctx − 4096 × 0.7 safety factor`. At `max_ctx=65536` it picks the
32K case (42K real tokens); at `max_ctx=98304` it picks the 64K case
(~85K real tokens). The 65K-cell 4.0 tok/s looks better than the 98K-cell
1.2 tok/s, but they measured different amounts of work — not the same
workload on different configs.

A sweeper sorting by tok/s would pick 65536/q8_0/b16 as the "winner",
which would silently cap real agentic sessions at 64K and OOM on longer
ones. The winner selection was updated (see below) to prefer larger
max_ctx first.

### 5. Qwen3.6 tokenizer expansion: 1.30× on this fixture

The 32K-bucket case has `context_tokens_approx = 32768` (chars/4 estimate)
but the server reports **42,735** real prompt tokens after Qwen3.6
tokenization + chat template wrapping. Expansion ratio: **1.30×**. Compare:
gemma-4-26b on sindri showed ~1.39× on the same fixture.

## Heuristic updates applied

**`autotune.py` — `runtime_from_host()` for 22-31 GB tier:**
Explicitly set `cache_type_k="tq3_0", cache_type_v="tq3_0"` for both WSL
and native 22-31 GB paths. Previously the field was left empty (server
default), which could be q8_0 or f16 — both OOM at 98K on 23 GB VRAM.

**`autotune.py` — `_coding_agent_loop_qwen_bracket()` for 22-31 GB:**
Skip q8_0 when `max_ctx >= 98304`. Previously all 12 cells were generated;
the 6 q8_0/98K cells always fail on 23 GB hardware, wasting ~30 min of
sweep time. Reduced to 9 cells (tq3_0+q8_0 at 65K, tq3_0-only at 98K).

**`sweep.py` — `_pick_winner()` for `agent_replay_pass_rate`:**
Changed primary sort key from `-speed_metric` to `-max_ctx`. Rationale:
different max_ctx values exercise different-sized fixture cases (see
finding #4). Speed is only meaningful within the same max_ctx group. The
corrected sort ensures the winner always uses the largest viable context
window, then optimizes speed within that group.

## Recommended config (bragi, Qwen3.6-27B, 23 GB VRAM WSL2)

```toml
[dflash]
budget = 16
max_ctx = 98304
cache_type_k = "tq3_0"
cache_type_v = "tq3_0"
```

Prefill throughput: ~500 tok/s. Decode throughput at 85K-token context:
**~1.2 tok/s** (speculative decode, 256-token response). Wall time for a
full 90K-token agentic session: ~210 s to first token, then ~1.2 tok/s.
Loading
Loading