Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#22129
Conversation
Skip forward past nodes that don't consume the current one, and allow a chain of MULs.
|
What's special about the MUL mode, can't it skip over more generally looking at the src pointers? |
This comment was marked as spam.
This comment was marked as spam.
@JohannesGaessler 's original code was trying to match the common pattern found in MOE models. I have tried to extend it to cover Gemma. This check can be extended to include other ops that will work with |
| // Chain of MULs with MIRRORED src[1] | ||
| while (true) { | ||
| skip_unrelated(); | ||
| if (id + 1 >= cgraph->n_nodes) { | ||
| return idr; | ||
| } | ||
| ggml_tensor * next = cgraph->nodes[id+1]; | ||
| if (next->op == GGML_OP_MUL && next->src[0] == node && | ||
| ggml_backend_meta_get_split_state(next->src[1], false).axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) { | ||
| node = next; | ||
| id++; | ||
| idr = id; | ||
| n_used = ggml_node_get_use_count(cgraph, id); | ||
| } else { | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
This code is effectively a roundabout way to check condition 2 that I outlined previously. The PR is fine like this, but I'll maybe refactor and simplify this in the future when I touch this part of the code again.
* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments
* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments
* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments
* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments
* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments
* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments
Skip forward past nodes that don't consume the current node, and allow a chain of MULs.
When
down_exps_sis set, build_moe_ffn pulls the scale tensor in via reshape/repeat/get_rows. Topological sort places those betweenmul_mat_idand the MUL that consumes it, so the existing nodes[id+1] check never sees an ADD_ID or MUL and fails.The scale MUL is followed by a second MUL; the old code only accepted one.
Performance on 2x 5090:
Requirements