perf(prefill): MoE prefill CUDA-graph capture — +9-27% pp512 on NVFP4 #179
background
wait
wait-all
cancel
Loading