UPSTREAM PR #22129: Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE by loci-dev · Pull Request #1361 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-20T03:11:59Z

Note

Source pull request: ggml-org/llama.cpp#22129

Skip forward past nodes that don't consume the current node, and allow a chain of MULs.

When down_exps_s is set, build_moe_ffn pulls the scale tensor in via reshape/repeat/get_rows. Topological sort places those between mul_mat_id and the MUL that consumes it, so the existing nodes[id+1] check never sees an ADD_ID or MUL and fails.

The scale MUL is followed by a second MUL; the old code only accepted one.

Performance on 2x 5090:

model	test	t/s - `d5b780a`	t/s - PR	Speed-up
gemma4 26B.A4B Q8_0	pp512	3473.35	7202.7	2.07
gemma4 26B.A4B Q8_0	tg128	164.29	202.45	1.23

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, took help in identifying the root cause

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

loci-review · 2026-04-20T03:54:39Z

Target version:

The base version shows minimal call depth (stub implementation), while the target version reveals the complete training pipeline with deep call stacks through ggml_opt_epoch (87.7% of time), ggml_opt_alloc, ggml_backend_sched_alloc_graph, and ggml_opt_eval.

Additional Findings

Multi-GPU MoE Impact: The meta-backend changes improve correctness for distributed tensor-parallel inference. The 142.7 µs local overhead is negligible compared to milliseconds saved per avoided AllReduce operation in multi-GPU Gemma-4/Mixtral deployments. Single-GPU and CPU inference paths are unaffected.

💬 Questions? Tag @loci-dev

Fix delayed AllReduce on Gemma-4 MoE

4ce8fde

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

loci-dev deployed to PROD__AL_DEMO April 20, 2026 03:12 — with GitHub Actions Active

loci-dev mentioned this pull request Apr 20, 2026

Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE ggml-org/llama.cpp#22129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #22129: Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#1361

UPSTREAM PR #22129: Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#1361
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-22129-gemma4_perf

loci-dev commented Apr 20, 2026

Uh oh!

loci-review Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 20, 2026

Requirements

Uh oh!

loci-review Bot commented Apr 20, 2026

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants