Skip to content

chore(moe): remove dead NVFP4 prefill paths (IMP_NVFP4_FUSED_GATEUP + Phase 3c-MVP)#170

Merged
github-actions[bot] merged 1 commit into
mainfrom
perf/dead-code-cleanup
May 14, 2026
Merged

chore(moe): remove dead NVFP4 prefill paths (IMP_NVFP4_FUSED_GATEUP + Phase 3c-MVP)#170
github-actions[bot] merged 1 commit into
mainfrom
perf/dead-code-cleanup

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 14, 2026

Summary

Two stale dispatch lambdas inside the NVFP4 MoE prefill fallback block. Both superseded by the default-on device-args path (PR #164, default-flipped 2026-05-14 via commit `6e2d402`) but kept as opt-in escape hatches that nothing exercises in production.

1. IMP_NVFP4_FUSED_GATEUP — self-documented as dead

Old comment (line 1955):

Currently flat-perf vs default (GrpGemm cache in `769effe` already absorbs the per-dispatch overhead). Preserved for future scenarios.

Removed: ~80 LoC fused-dispatch attempt + its sole consumer module (`executor_forward_moe_cutlass3x.{cu,h}`, 91 LoC).

2. Phase 3c-MVP `IMP_NVFP4_DEVICE_ARGS=1` opt-in

Old comment (line 1840-1848):

Phase 3c-MVP wire ... MVP does NOT yet enable CUDA-graph capture of the prefill path — Phase 3c-full removes that residual sync.

Phase 3c-full landed via PR #164 (Step 3 = pre-cached per-layer device-args ptr arrays) and was default-flipped. The MVP path is unreachable from production:

  • Gated by `IMP_NVFP4_DEVICE_ARGS=1` (default OFF for this code path)
  • Preconditions (`d_M_per`, `d_sfa_offsets`, `d_B_ptrs_cache`, ...) overlap with the default-on path's preconditions, which runs first and sets `device_args_done=true` when those buffers exist
  • Either default-on already handled it, or the workspace is unpopulated and MVP would also fail

Removed: ~70 LoC inside `grouped_gemm` lambda.

Production impact

None. The legacy host-args fallback path remains as the safety net for the default-on precondition-check failure case.

Test Result
verify-fast ✅ decode +3.12%, prefill +3.97%, graphs 1.49×
Qwen3-Coder-30B-NVFP4 pp512 16773 tok/s (matches pre-cleanup)
Qwen3-Coder-30B-NVFP4 tg32 252 tok/s (matches pre-cleanup)

Diff

```
4 files changed, 15 insertions(+), 243 deletions(-)
```
-228 LoC, -1 TU, -1 header. No public API change.

🤖 Generated with Claude Code

… Phase 3c-MVP)

Two stale dispatch lambdas inside the NVFP4 MoE prefill fallback block. Both
were superseded by the default-on device-args path (PR #164, default-flipped
2026-05-14 via commit 6e2d402) but kept as opt-in escape hatches that nothing
exercises in production.

## 1. IMP_NVFP4_FUSED_GATEUP — self-documented as dead
Old comment (executor_forward_moe.cu:1955):
  "Currently flat-perf vs default (GrpGemm cache in 769effe already absorbs
   the per-dispatch overhead). Preserved for future scenarios."
Removed: ~80 LoC fused-dispatch attempt + its sole consumer module
(executor_forward_moe_cutlass3x.{cu,h}, 91 LoC).

## 2. Phase 3c-MVP IMP_NVFP4_DEVICE_ARGS=1 opt-in
Old comment (executor_forward_moe.cu:1840-1848):
  "Phase 3c-MVP wire ... MVP does NOT yet enable CUDA-graph capture of the
   prefill path — Phase 3c-full removes that residual sync."
Phase 3c-full landed via PR #164 (Step 3 = pre-cached per-layer device-args
ptr arrays) and was default-flipped. The MVP path is unreachable from production
because:
  - It is gated by IMP_NVFP4_DEVICE_ARGS=1 (default OFF for this code path).
  - Its preconditions (d_M_per, d_sfa_offsets, d_B_ptrs_cache, etc.) overlap
    with the default-on path's preconditions, which already runs first and
    sets device_args_done=true when those buffers exist.
  - So either default-on already handled it, or the workspace is unpopulated
    and MVP would also fail.
Removed: ~70 LoC inside grouped_gemm lambda.

## Production impact
None. The legacy host-args fallback path remains as the safety net for the
default-on precondition-check failure case. verify-fast green (decode +3.12%,
prefill +3.97%, graphs 1.49×). Qwen3-Coder-30B-NVFP4 pp512 = 16773 tok/s,
tg32 = 252 tok/s — matches pre-cleanup measurements.

## Net diff
-228 LoC, -1 TU, -1 header. No public API change.
@github-actions github-actions Bot enabled auto-merge (squash) May 14, 2026 11:10
@github-actions github-actions Bot merged commit 5153d6a into main May 14, 2026
3 checks passed
@kekzl kekzl deleted the perf/dead-code-cleanup branch May 14, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant