Skip to content

UPSTREAM PR #22129: Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#1361

Open
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-22129-gemma4_perf
Open

UPSTREAM PR #22129: Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE#1361
loci-dev wants to merge 1 commit into
mainfrom
loci/pr-22129-gemma4_perf

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#22129

Skip forward past nodes that don't consume the current node, and allow a chain of MULs.

When down_exps_s is set, build_moe_ffn pulls the scale tensor in via reshape/repeat/get_rows. Topological sort places those between mul_mat_id and the MUL that consumes it, so the existing nodes[id+1] check never sees an ADD_ID or MUL and fails.

The scale MUL is followed by a second MUL; the old code only accepted one.

Performance on 2x 5090:

model test t/s - d5b780a t/s - PR Speed-up
gemma4 26B.A4B Q8_0 pp512 3473.35 7202.7 2.07
gemma4 26B.A4B Q8_0 tg128 164.29 202.45 1.23

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, took help in identifying the root cause

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.
@loci-dev loci-dev deployed to PROD__AL_DEMO April 20, 2026 03:12 — with GitHub Actions Active
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 20, 2026

Flame Graph: build.bin.libggml-base.so::ggml_opt_fit

Target version:

Flame Graph: build.bin.libggml-base.so::ggml_opt_fit

The base version shows minimal call depth (stub implementation), while the target version reveals the complete training pipeline with deep call stacks through ggml_opt_epoch (87.7% of time), ggml_opt_alloc, ggml_backend_sched_alloc_graph, and ggml_opt_eval.

Additional Findings

Multi-GPU MoE Impact: The meta-backend changes improve correctness for distributed tensor-parallel inference. The 142.7 µs local overhead is negligible compared to milliseconds saved per avoided AllReduce operation in multi-GPU Gemma-4/Mixtral deployments. Single-GPU and CPU inference paths are unaffected.

💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants