perf(dsv4 prefill): avoid indexer pool token scan#618
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe compressor now scatters KV and score projection outputs into flattened inner state before pooling, the pool stage reads those stored values instead of scanning tokens, and the later state-update loop is removed. ChangesInner state scatter and pool reuse
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the prefill_indexer_compressor for DeepSeek V4. It introduces a new pre-pass SPMD loop (prefill_idx_c4_state_scatter_pre) to handle state updates before the softmax pooling step, and removes the inline pooling loop over tokens as well as the post-pooling state update loop (prefill_idx_c4_state_update). I have no feedback to provide as there are no review comments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Summary
Validation
python3 -m py_compile models/deepseek/v4/prefill_indexer_compressor.pygit diff --check origin/main...HEAD -- models/deepseek/v4/prefill_indexer_compressor.pypython models/deepseek/v4/prefill_indexer_compressor.py -p a2a3--enable-pmu 2:3.82s -> 3.16sprefill_idx_c4_softmax_poolPMU sum cycles:2829455 -> 1540161(-45.6%)6701356 -> 4444453(-33.7%)start_pos=0,1,2,3,4,5,63,64,127,128all pass after rerunning one transient device-side 507018 onstart_pos=0.Notes
This follows the DeepSeek V4 decode-state ordering: current tokens are written into state before a compression boundary pools from the overlap window. If a future scheduler allows a valid current token to participate in pooling without a valid
inner_state_slot_mapping, this path would need an overlay fallback.