Summary
z-lab/gemma-4-26B-A4B-it-PARO (PARO INT4, MoE 256 experts, ~4B active) crashes with a Metal GPU command-buffer OOM during the MoE forward pass on the very first inference request on M1 Max 64 GB. The model loads cleanly (16 GB weights, ~27 GB unified-memory headroom available at crash time). The crash signature is:
[METAL] Command buffer execution failed: Insufficient Memory
(00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
libc++abi: terminating due to uncaught exception of type std::runtime_error
The process aborts and the serving port becomes unusable.
Why this is a paroquant bug, not an oMLX (or VLM-path) bug
This was previously misattributed to oMLX's VLM inference path (the original symptom was "oMLX SIGKILL on inference"). To rule that out, I served the same weights via paro_serve.py (the LM-only wrapper used for dense E2B/E4B/31B PARO variants on ports 1290-1293, which loads via paroquant.inference.backends.mlx.load.load(..., force_text=True)), bypassing oMLX entirely on port 1294. The crash reproduces identically, same Metal error message, same libc++abi terminate.
So the bug is in paroquant's MLX MoE forward pass, not in oMLX or the VLM dispatcher.
Why MoE-specific
The three dense PARO variants on the same hardware and runtime all work cleanly through paro_serve.py:
z-lab/gemma-4-E2B-it-PARO (~7.3 GB) ✓ works
z-lab/gemma-4-E4B-it-PARO (~10.1 GB) ✓ works (43 t/s warm)
z-lab/gemma-4-31B-it-PARO (~19.3 GB) ✓ works (14 t/s warm at 32K)
z-lab/gemma-4-26B-A4B-it-PARO (~16.3 GB) ✗ crashes on first inference
The 26B-A4B is the only MoE PARO in the lineup. The Unsloth unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit (~15.75 GB) works on the same hardware and oMLX runtime — different MoE dispatch path. So the MoE expert dispatch in paroquant's MLX backend is hitting a Metal command-buffer ceiling that the per-token dense path doesn't approach.
Hardware / environment
- Apple M1 Max, 64 GB unified
- macOS 25.4
paroquant[mlx] 0.1.15 via paroquant.inference.backends.mlx.load.load
- mlx-lm 0.31.3
- oMLX 0.3.9.dev2 (also reproduces under raw
paro_serve.py)
- Python 3.13
Reproducer
- Pull
z-lab/gemma-4-26B-A4B-it-PARO.
- Either:
mtp_enabled=false in oMLX model_settings.json and hit /v1/chat/completions with model_id gemma-4-26B-A4B-it-PARO on port 1234, OR
- Run
python3.13 paro_serve.py --model <weights> --host 127.0.0.1 --port 1294 --temp 1.0 --top-p 0.95 --top-k 64 and hit port 1294.
- Send any non-trivial completion request (even a 20-token "say hi" prompt is enough; doesn't need long context).
- Server dies with the Metal error above. Port no longer accepting connections.
Workaround in use
For Mac MoE Gemma serving: unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit (different MoE dispatch). Tested and working on the same setup.
Hypothesis on root cause
Per-expert command buffer dispatch in the MoE forward path is allocating beyond what Metal's command-buffer ceiling permits on M1 Max's GPU (32 GB shared GPU pool out of 64 GB unified, give or take). The pattern likely shows up at 256 experts (Gemma-4 26B-A4B configuration). Possible fix directions:
- Batch expert dispatches into fewer command buffers.
- Stream the per-expert buffers (recycle the buffer after dispatch).
- Cap the in-flight expert-dispatch count.
Companion data
This is captured in our test fleet's per-model facts card. Happy to share the full server-log fragment, the paro_serve.py repro log, the paroquant.cli.serve repro log, or test on alternative paroquant builds if useful.
Related issue I'll file separately: the PARO HF config declares mtp_num_hidden_layers=1 without shipping the MTP head, causing 0% accept + ~50% overhead in any runtime that takes the config at face value.
Summary
z-lab/gemma-4-26B-A4B-it-PARO(PARO INT4, MoE 256 experts, ~4B active) crashes with a Metal GPU command-buffer OOM during the MoE forward pass on the very first inference request on M1 Max 64 GB. The model loads cleanly (16 GB weights, ~27 GB unified-memory headroom available at crash time). The crash signature is:The process aborts and the serving port becomes unusable.
Why this is a paroquant bug, not an oMLX (or VLM-path) bug
This was previously misattributed to oMLX's VLM inference path (the original symptom was "oMLX SIGKILL on inference"). To rule that out, I served the same weights via
paro_serve.py(the LM-only wrapper used for dense E2B/E4B/31B PARO variants on ports 1290-1293, which loads viaparoquant.inference.backends.mlx.load.load(..., force_text=True)), bypassing oMLX entirely on port 1294. The crash reproduces identically, same Metal error message, same libc++abi terminate.So the bug is in paroquant's MLX MoE forward pass, not in oMLX or the VLM dispatcher.
Why MoE-specific
The three dense PARO variants on the same hardware and runtime all work cleanly through
paro_serve.py:z-lab/gemma-4-E2B-it-PARO(~7.3 GB) ✓ worksz-lab/gemma-4-E4B-it-PARO(~10.1 GB) ✓ works (43 t/s warm)z-lab/gemma-4-31B-it-PARO(~19.3 GB) ✓ works (14 t/s warm at 32K)z-lab/gemma-4-26B-A4B-it-PARO(~16.3 GB) ✗ crashes on first inferenceThe 26B-A4B is the only MoE PARO in the lineup. The Unsloth
unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit(~15.75 GB) works on the same hardware and oMLX runtime — different MoE dispatch path. So the MoE expert dispatch in paroquant's MLX backend is hitting a Metal command-buffer ceiling that the per-token dense path doesn't approach.Hardware / environment
paroquant[mlx]0.1.15 viaparoquant.inference.backends.mlx.load.loadparo_serve.py)Reproducer
z-lab/gemma-4-26B-A4B-it-PARO.mtp_enabled=falsein oMLXmodel_settings.jsonand hit/v1/chat/completionswith model_idgemma-4-26B-A4B-it-PAROon port 1234, ORpython3.13 paro_serve.py --model <weights> --host 127.0.0.1 --port 1294 --temp 1.0 --top-p 0.95 --top-k 64and hit port 1294.Workaround in use
For Mac MoE Gemma serving:
unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit(different MoE dispatch). Tested and working on the same setup.Hypothesis on root cause
Per-expert command buffer dispatch in the MoE forward path is allocating beyond what Metal's command-buffer ceiling permits on M1 Max's GPU (32 GB shared GPU pool out of 64 GB unified, give or take). The pattern likely shows up at 256 experts (Gemma-4 26B-A4B configuration). Possible fix directions:
Companion data
This is captured in our test fleet's per-model facts card. Happy to share the full server-log fragment, the
paro_serve.pyrepro log, theparoquant.cli.serverepro log, or test on alternative paroquant builds if useful.Related issue I'll file separately: the PARO HF config declares
mtp_num_hidden_layers=1without shipping the MTP head, causing 0% accept + ~50% overhead in any runtime that takes the config at face value.