You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gemma4_31b: stage decode positions on device (pos-array + per-step D2D)
Kill the per-decode-round position H2D (the last per-round host->device copy
left after Option A): upload the full decode position array to device once
(single H2D), then each step copy that step's position from the array into the
fixed position input slot with an on-device D2D. Token stays aliased on device
(Option A). Per-round HtoD is now 0, independent of decode length; the fixed
input slot keeps it cuda-graph-safe (with cuda graph on, the D2D becomes a
captured cudaMemcpyAsync on the decode stream into the same slot).
Measured (int6/gguf, cuda graph OFF, p19/d128): post-load HtoD 132->5
(per-round H2D=0); DtoD 129->257 (+128 per-round pos d2d, the intended
H2D->d2d trade); DtoH unchanged (129). Greedy output byte-identical to prior
runs. Runner-only; reuses the int64-output export (no re-export).
0 commit comments