chore(moe): remove dead NVFP4 prefill paths (IMP_NVFP4_FUSED_GATEUP + Phase 3c-MVP)#170
Merged
Merged
Conversation
… Phase 3c-MVP) Two stale dispatch lambdas inside the NVFP4 MoE prefill fallback block. Both were superseded by the default-on device-args path (PR #164, default-flipped 2026-05-14 via commit 6e2d402) but kept as opt-in escape hatches that nothing exercises in production. ## 1. IMP_NVFP4_FUSED_GATEUP — self-documented as dead Old comment (executor_forward_moe.cu:1955): "Currently flat-perf vs default (GrpGemm cache in 769effe already absorbs the per-dispatch overhead). Preserved for future scenarios." Removed: ~80 LoC fused-dispatch attempt + its sole consumer module (executor_forward_moe_cutlass3x.{cu,h}, 91 LoC). ## 2. Phase 3c-MVP IMP_NVFP4_DEVICE_ARGS=1 opt-in Old comment (executor_forward_moe.cu:1840-1848): "Phase 3c-MVP wire ... MVP does NOT yet enable CUDA-graph capture of the prefill path — Phase 3c-full removes that residual sync." Phase 3c-full landed via PR #164 (Step 3 = pre-cached per-layer device-args ptr arrays) and was default-flipped. The MVP path is unreachable from production because: - It is gated by IMP_NVFP4_DEVICE_ARGS=1 (default OFF for this code path). - Its preconditions (d_M_per, d_sfa_offsets, d_B_ptrs_cache, etc.) overlap with the default-on path's preconditions, which already runs first and sets device_args_done=true when those buffers exist. - So either default-on already handled it, or the workspace is unpopulated and MVP would also fail. Removed: ~70 LoC inside grouped_gemm lambda. ## Production impact None. The legacy host-args fallback path remains as the safety net for the default-on precondition-check failure case. verify-fast green (decode +3.12%, prefill +3.97%, graphs 1.49×). Qwen3-Coder-30B-NVFP4 pp512 = 16773 tok/s, tg32 = 252 tok/s — matches pre-cleanup measurements. ## Net diff -228 LoC, -1 TU, -1 header. No public API change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two stale dispatch lambdas inside the NVFP4 MoE prefill fallback block. Both superseded by the default-on device-args path (PR #164, default-flipped 2026-05-14 via commit `6e2d402`) but kept as opt-in escape hatches that nothing exercises in production.
1. IMP_NVFP4_FUSED_GATEUP — self-documented as dead
Old comment (line 1955):
Removed: ~80 LoC fused-dispatch attempt + its sole consumer module (`executor_forward_moe_cutlass3x.{cu,h}`, 91 LoC).
2. Phase 3c-MVP `IMP_NVFP4_DEVICE_ARGS=1` opt-in
Old comment (line 1840-1848):
Phase 3c-full landed via PR #164 (Step 3 = pre-cached per-layer device-args ptr arrays) and was default-flipped. The MVP path is unreachable from production:
Removed: ~70 LoC inside `grouped_gemm` lambda.
Production impact
None. The legacy host-args fallback path remains as the safety net for the default-on precondition-check failure case.
Diff
```
4 files changed, 15 insertions(+), 243 deletions(-)
```
-228 LoC, -1 TU, -1 header. No public API change.
🤖 Generated with Claude Code