Add basic EP support (no overlapping)#54
Open
GarlGuo wants to merge 15 commits into
Open
Conversation
|
@codex review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR will use triton + symmetric memory to provide basic EP support for SonicMoE. We implement collectives with close-to-peak network bandwidth rate and change the metadata accordingly.
Co-authored by Claude Code
The forward dispatches each rank's
T_localtokens to the experts that hold them via NVLink symmetric memory, runs the grouped GEMMs locally, and combines back across NVLink. A runtimeNetworkProfilerbenchmarks the three dispatch and three combine primitives on the local hardware and picks the fastest pair per workload.EP world size 8:
The EP forward exposes two optional flags that trade off activation memory, NVLink bandwidth in backward, and a host-stall on the forward.
--redispatch_x_in_backward(default to False): instead of saving the post-dispatchx_computefor the backward, save only the pre-dispatchx_localand re-dispatch in the backward via a Copy-Engine all-gather on a side stream.--CPU_sync_on_runtime(default to False): initiate D2H sync to shrink the saved activation cache. The trade-off is a single host stall per forward. Inference mode skips this since no cache is saved.Example usage:
Example output: