Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE by gaugarg-nv · Pull Request #22129 · ggml-org/llama.cpp

gaugarg-nv · 2026-04-19T18:14:21Z

Skip forward past nodes that don't consume the current node, and allow a chain of MULs.

When down_exps_s is set, build_moe_ffn pulls the scale tensor in via reshape/repeat/get_rows. Topological sort places those between mul_mat_id and the MUL that consumes it, so the existing nodes[id+1] check never sees an ADD_ID or MUL and fails.

The scale MUL is followed by a second MUL; the old code only accepted one.

Performance on 2x 5090:

model	test	t/s - `d5b780a`	t/s - PR	Speed-up
gemma4 26B.A4B Q8_0	pp512	3473.35	7202.7	2.07
gemma4 26B.A4B Q8_0	tg128	164.29	202.45	1.23

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, took help in identifying the root cause

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

am17an · 2026-04-20T02:48:36Z

What's special about the MUL mode, can't it skip over more generally looking at the src pointers?

gaugarg-nv · 2026-04-20T04:19:15Z

What's special about the MUL mode, can't it skip over more generally looking at the src pointers?

@JohannesGaessler 's original code was trying to match the common pattern found in MOE models. I have tried to extend it to cover Gemma.

This check can be extended to include other ops that will work with PARTIAL and MIRRORED src, like Scale, Div, Cpy, but won't work for ops like Add, or non-linear ops.

JohannesGaessler · 2026-04-20T12:31:46Z

+                // Chain of MULs with MIRRORED src[1]
+                while (true) {
+                    skip_unrelated();
+                    if (id + 1 >= cgraph->n_nodes) {
+                        return idr;
+                    }
                    ggml_tensor * next = cgraph->nodes[id+1];
                    if (next->op == GGML_OP_MUL && next->src[0] == node &&
                            ggml_backend_meta_get_split_state(next->src[1], false).axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
                        node = next;
                        id++;
                        idr = id;
                        n_used = ggml_node_get_use_count(cgraph, id);
+                    } else {
+                        break;
                    }
                }


This code is effectively a roundabout way to check condition 2 that I outlined previously. The PR is fine like this, but I'll maybe refactor and simplify this in the future when I touch this part of the code again.

* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments

Fix delayed AllReduce on Gemma-4 MoE

4ce8fde

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 19, 2026

am17an approved these changes Apr 20, 2026

View reviewed changes

This comment was marked as spam.

Sign in to view

JohannesGaessler reviewed Apr 20, 2026

View reviewed changes

Comment thread ggml/src/ggml-backend-meta.cpp

Check for all sources before skipping nodes

07a1585

JohannesGaessler approved these changes Apr 20, 2026

View reviewed changes

Address review comments

63c7607

JohannesGaessler approved these changes Apr 20, 2026

View reviewed changes

JohannesGaessler merged commit fd6ae4c into ggml-org:master Apr 20, 2026
50 of 51 checks passed

gaugarg-nv deleted the gemma4_perf branch April 21, 2026 04:40

cgarwood82 mentioned this pull request Apr 24, 2026

Eval bug: --split-mode tensor aborts in ggml_backend_meta_buffer_get_tensor with Qwen3 MoE Q8_K_XL on ROCm #22307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#22129

Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#22129
JohannesGaessler merged 3 commits into
ggml-org:masterfrom
gaugarg-nv:gemma4_perf

gaugarg-nv commented Apr 19, 2026 •

edited

Loading

Uh oh!

am17an commented Apr 20, 2026

Uh oh!

This comment was marked as spam.

gaugarg-nv commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gaugarg-nv commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

am17an commented Apr 20, 2026

Uh oh!

This comment was marked as spam.

gaugarg-nv commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gaugarg-nv commented Apr 19, 2026 •

edited

Loading