A faster Mamba selective scan for Apple Silicon (custom Metal kernel) #1278

createcentury · 2026-05-15T18:24:58Z

createcentury
May 15, 2026

Hi mlx-lm team & community,

mlx_lm/models/mamba.py currently implements selective scan as a Python for t in range(T) loop. This works correctly but is sequential — every timestep launches new MLX ops, so prefill cost grows linearly with T in serial.

I wrote a Metal Shading Language kernel for the same recurrence that uses a parallel prefix scan over the associative (a, b) pair operator — the same approach Mamba's CUDA selective_scan_fwd_kernel.cuh takes. It runs through mx.fast.metal_kernel and supports:

variable B / C
D skip connection
delta_softplus
z gate (SiLU)
fp16 inputs with fp32 scan accumulation
arbitrary seqlen via chunked SRAM running prefix
a ssm_state_out output for inference state caching (mirrors Mamba CUDA's params.x_ptr)

Same M4 Max, same mamba-130m-hf checkpoint, same prompt + 50 decoded tokens (greedy):

prompt tokens	mlx-lm	mamba-metal	speedup
71	0.26 s	0.21 s	1.22×
351	0.65 s	0.34 s	1.90×
1,401	2.41 s	0.30 s	8.14×
5,601	9.01 s	0.82 s	11.03×

All five state-spaces/mamba-{130m, 370m, 790m, 1.4b, 2.8b}-hf checkpoints load and generate end-to-end.

Repo: https://github.com/createcentury/mamba-metal — kernels are first-class .metal files under mamba_metal/kernels/.

Happy to discuss integration approaches if there's interest — the parallel-scan kernel could drop into mlx-lm's Mamba path for prefill while keeping the existing per-step loop for decode (which is already optimal at T=1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A faster Mamba selective scan for Apple Silicon (custom Metal kernel) #1278

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

A faster Mamba selective scan for Apple Silicon (custom Metal kernel) #1278

Uh oh!

createcentury May 15, 2026

Replies: 0 comments

createcentury
May 15, 2026