Gemma-4 26B-A4B PARO: Metal GPU command-buffer OOM during MoE forward pass on M1 Max (not VLM-path SIGKILL)

## Summary

`z-lab/gemma-4-26B-A4B-it-PARO` (PARO INT4, MoE 256 experts, ~4B active) crashes with a **Metal GPU command-buffer OOM during the MoE forward pass** on the very first inference request on M1 Max 64 GB. The model loads cleanly (16 GB weights, ~27 GB unified-memory headroom available at crash time). The crash signature is:

```
[METAL] Command buffer execution failed: Insufficient Memory
        (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
libc++abi: terminating due to uncaught exception of type std::runtime_error
```

The process aborts and the serving port becomes unusable.

## Why this is a paroquant bug, not an oMLX (or VLM-path) bug

This was previously misattributed to oMLX's VLM inference path (the original symptom was "oMLX SIGKILL on inference"). To rule that out, I served the same weights via `paro_serve.py` (the LM-only wrapper used for dense E2B/E4B/31B PARO variants on ports 1290-1293, which loads via `paroquant.inference.backends.mlx.load.load(..., force_text=True)`), bypassing oMLX entirely on port 1294. The crash reproduces **identically**, same Metal error message, same libc++abi terminate.

So the bug is in paroquant's MLX MoE forward pass, not in oMLX or the VLM dispatcher.

## Why MoE-specific

The three dense PARO variants on the same hardware and runtime all work cleanly through `paro_serve.py`:
- `z-lab/gemma-4-E2B-it-PARO` (~7.3 GB) ✓ works
- `z-lab/gemma-4-E4B-it-PARO` (~10.1 GB) ✓ works (43 t/s warm)
- `z-lab/gemma-4-31B-it-PARO` (~19.3 GB) ✓ works (14 t/s warm at 32K)
- `z-lab/gemma-4-26B-A4B-it-PARO` (~16.3 GB) ✗ **crashes on first inference**

The 26B-A4B is the only MoE PARO in the lineup. The Unsloth `unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit` (~15.75 GB) works on the same hardware and oMLX runtime — different MoE dispatch path. So the MoE expert dispatch in paroquant's MLX backend is hitting a Metal command-buffer ceiling that the per-token dense path doesn't approach.

## Hardware / environment

- Apple M1 Max, 64 GB unified
- macOS 25.4
- `paroquant[mlx]` 0.1.15 via `paroquant.inference.backends.mlx.load.load`
- mlx-lm 0.31.3
- oMLX 0.3.9.dev2 (also reproduces under raw `paro_serve.py`)
- Python 3.13

## Reproducer

1. Pull `z-lab/gemma-4-26B-A4B-it-PARO`.
2. Either:
   - `mtp_enabled=false` in oMLX `model_settings.json` and hit `/v1/chat/completions` with model_id `gemma-4-26B-A4B-it-PARO` on port 1234, **OR**
   - Run `python3.13 paro_serve.py --model <weights> --host 127.0.0.1 --port 1294 --temp 1.0 --top-p 0.95 --top-k 64` and hit port 1294.
3. Send any non-trivial completion request (even a 20-token "say hi" prompt is enough; doesn't need long context).
4. Server dies with the Metal error above. Port no longer accepting connections.

## Workaround in use

For Mac MoE Gemma serving: `unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit` (different MoE dispatch). Tested and working on the same setup.

## Hypothesis on root cause

Per-expert command buffer dispatch in the MoE forward path is allocating beyond what Metal's command-buffer ceiling permits on M1 Max's GPU (32 GB shared GPU pool out of 64 GB unified, give or take). The pattern likely shows up at 256 experts (Gemma-4 26B-A4B configuration). Possible fix directions:

- Batch expert dispatches into fewer command buffers.
- Stream the per-expert buffers (recycle the buffer after dispatch).
- Cap the in-flight expert-dispatch count.

## Companion data

This is captured in our test fleet's per-model facts card. Happy to share the full server-log fragment, the `paro_serve.py` repro log, the `paroquant.cli.serve` repro log, or test on alternative paroquant builds if useful.

Related issue I'll file separately: the PARO HF config declares `mtp_num_hidden_layers=1` without shipping the MTP head, causing 0% accept + ~50% overhead in any runtime that takes the config at face value.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma-4 26B-A4B PARO: Metal GPU command-buffer OOM during MoE forward pass on M1 Max (not VLM-path SIGKILL) #46

Summary

Why this is a paroquant bug, not an oMLX (or VLM-path) bug

Why MoE-specific

Hardware / environment

Reproducer

Workaround in use

Hypothesis on root cause

Companion data

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gemma-4 26B-A4B PARO: Metal GPU command-buffer OOM during MoE forward pass on M1 Max (not VLM-path SIGKILL) #46

Description

Summary

Why this is a paroquant bug, not an oMLX (or VLM-path) bug

Why MoE-specific

Hardware / environment

Reproducer

Workaround in use

Hypothesis on root cause

Companion data

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions