Skip to content

Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#22129

Merged
JohannesGaessler merged 3 commits into
ggml-org:masterfrom
gaugarg-nv:gemma4_perf
Apr 20, 2026
Merged

Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#22129
JohannesGaessler merged 3 commits into
ggml-org:masterfrom
gaugarg-nv:gemma4_perf

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

@gaugarg-nv gaugarg-nv commented Apr 19, 2026

Skip forward past nodes that don't consume the current node, and allow a chain of MULs.

When down_exps_s is set, build_moe_ffn pulls the scale tensor in via reshape/repeat/get_rows. Topological sort places those between mul_mat_id and the MUL that consumes it, so the existing nodes[id+1] check never sees an ADD_ID or MUL and fails.

The scale MUL is followed by a second MUL; the old code only accepted one.

Performance on 2x 5090:

model test t/s - d5b780a t/s - PR Speed-up
gemma4 26B.A4B Q8_0 pp512 3473.35 7202.7 2.07
gemma4 26B.A4B Q8_0 tg128 164.29 202.45 1.23

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, took help in identifying the root cause

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 19, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 20, 2026

What's special about the MUL mode, can't it skip over more generally looking at the src pointers?

@loci-dev

This comment was marked as spam.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

What's special about the MUL mode, can't it skip over more generally looking at the src pointers?

@JohannesGaessler 's original code was trying to match the common pattern found in MOE models. I have tried to extend it to cover Gemma.

This check can be extended to include other ops that will work with PARTIAL and MIRRORED src, like Scale, Div, Cpy, but won't work for ops like Add, or non-linear ops.

Comment thread ggml/src/ggml-backend-meta.cpp
Comment thread ggml/src/ggml-backend-meta.cpp Outdated
Comment on lines +1707 to 1723
// Chain of MULs with MIRRORED src[1]
while (true) {
skip_unrelated();
if (id + 1 >= cgraph->n_nodes) {
return idr;
}
ggml_tensor * next = cgraph->nodes[id+1];
if (next->op == GGML_OP_MUL && next->src[0] == node &&
ggml_backend_meta_get_split_state(next->src[1], false).axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
node = next;
id++;
idr = id;
n_used = ggml_node_get_use_count(cgraph, id);
} else {
break;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is effectively a roundabout way to check condition 2 that I outlined previously. The PR is fine like this, but I'll maybe refactor and simplify this in the future when I touch this part of the code again.

@JohannesGaessler JohannesGaessler merged commit fd6ae4c into ggml-org:master Apr 20, 2026
50 of 51 checks passed
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
@gaugarg-nv gaugarg-nv deleted the gemma4_perf branch April 21, 2026 04:40
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants