Skip to content

Gemma-4 26B-A4B PARO: Metal GPU command-buffer OOM during MoE forward pass on M1 Max (not VLM-path SIGKILL) #46

@sangemaru

Description

@sangemaru

Summary

z-lab/gemma-4-26B-A4B-it-PARO (PARO INT4, MoE 256 experts, ~4B active) crashes with a Metal GPU command-buffer OOM during the MoE forward pass on the very first inference request on M1 Max 64 GB. The model loads cleanly (16 GB weights, ~27 GB unified-memory headroom available at crash time). The crash signature is:

[METAL] Command buffer execution failed: Insufficient Memory
        (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
libc++abi: terminating due to uncaught exception of type std::runtime_error

The process aborts and the serving port becomes unusable.

Why this is a paroquant bug, not an oMLX (or VLM-path) bug

This was previously misattributed to oMLX's VLM inference path (the original symptom was "oMLX SIGKILL on inference"). To rule that out, I served the same weights via paro_serve.py (the LM-only wrapper used for dense E2B/E4B/31B PARO variants on ports 1290-1293, which loads via paroquant.inference.backends.mlx.load.load(..., force_text=True)), bypassing oMLX entirely on port 1294. The crash reproduces identically, same Metal error message, same libc++abi terminate.

So the bug is in paroquant's MLX MoE forward pass, not in oMLX or the VLM dispatcher.

Why MoE-specific

The three dense PARO variants on the same hardware and runtime all work cleanly through paro_serve.py:

  • z-lab/gemma-4-E2B-it-PARO (~7.3 GB) ✓ works
  • z-lab/gemma-4-E4B-it-PARO (~10.1 GB) ✓ works (43 t/s warm)
  • z-lab/gemma-4-31B-it-PARO (~19.3 GB) ✓ works (14 t/s warm at 32K)
  • z-lab/gemma-4-26B-A4B-it-PARO (~16.3 GB) ✗ crashes on first inference

The 26B-A4B is the only MoE PARO in the lineup. The Unsloth unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit (~15.75 GB) works on the same hardware and oMLX runtime — different MoE dispatch path. So the MoE expert dispatch in paroquant's MLX backend is hitting a Metal command-buffer ceiling that the per-token dense path doesn't approach.

Hardware / environment

  • Apple M1 Max, 64 GB unified
  • macOS 25.4
  • paroquant[mlx] 0.1.15 via paroquant.inference.backends.mlx.load.load
  • mlx-lm 0.31.3
  • oMLX 0.3.9.dev2 (also reproduces under raw paro_serve.py)
  • Python 3.13

Reproducer

  1. Pull z-lab/gemma-4-26B-A4B-it-PARO.
  2. Either:
    • mtp_enabled=false in oMLX model_settings.json and hit /v1/chat/completions with model_id gemma-4-26B-A4B-it-PARO on port 1234, OR
    • Run python3.13 paro_serve.py --model <weights> --host 127.0.0.1 --port 1294 --temp 1.0 --top-p 0.95 --top-k 64 and hit port 1294.
  3. Send any non-trivial completion request (even a 20-token "say hi" prompt is enough; doesn't need long context).
  4. Server dies with the Metal error above. Port no longer accepting connections.

Workaround in use

For Mac MoE Gemma serving: unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit (different MoE dispatch). Tested and working on the same setup.

Hypothesis on root cause

Per-expert command buffer dispatch in the MoE forward path is allocating beyond what Metal's command-buffer ceiling permits on M1 Max's GPU (32 GB shared GPU pool out of 64 GB unified, give or take). The pattern likely shows up at 256 experts (Gemma-4 26B-A4B configuration). Possible fix directions:

  • Batch expert dispatches into fewer command buffers.
  • Stream the per-expert buffers (recycle the buffer after dispatch).
  • Cap the in-flight expert-dispatch count.

Companion data

This is captured in our test fleet's per-model facts card. Happy to share the full server-log fragment, the paro_serve.py repro log, the paroquant.cli.serve repro log, or test on alternative paroquant builds if useful.

Related issue I'll file separately: the PARO HF config declares mtp_num_hidden_layers=1 without shipping the MTP head, causing 0% accept + ~50% overhead in any runtime that takes the config at face value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions