Skip to content

Add Flash-MoE-style SSD-backed expert streaming for large MoE models #986

@ehd4476

Description

@ehd4476

Is your feature request related to a problem? Please describe.

Yes. I would like to run very large Mixture-of-Experts models on Apple Silicon machines with limited unified memory, but current MLX-based inference workflows still generally assume that most model weights, including expert weights, are resident in memory or handled through conventional model loading patterns.

Projects such as flash-moe show that large MoE models can run on Apple Silicon by streaming expert weights from SSD instead of keeping the entire model in unified memory. Its README describes running Qwen3.5-397B-A17B on an M3 Max MacBook Pro with 48 GB RAM by streaming a 209 GB model from SSD through a custom Metal pipeline.

There are also MLX-related experiments in this direction. mlx-flash focuses on flash weight streaming for MLX and describes running models larger than RAM by streaming weights directly from SSD. anemll-flash-mlx is closer to the MoE-specific problem and emphasizes the boundary between dense execution, expert storage, expert selection, and expert consumption, using a stable bank / slot-bank approach with expert IDs as data.

However, there is still no clean, first-class MLX feature for Flash-MoE-style execution of large sparse MoE models where expert weights can be streamed, cached, or swapped efficiently without fully materializing all experts in unified memory.

Describe the solution you'd like

I would like MLX to support a Flash-MoE-style execution path for large MoE models on Apple Silicon.

Ideally, this would include:

SSD-backed or memory-mapped expert weight storage
Loading, streaming, or swapping only the selected experts needed for each token / layer
Keeping the dense path inside the normal MLX execution flow
Treating router outputs / expert IDs as data that can drive sparse expert execution
A stable per-layer expert bank or slot-bank abstraction
Efficient hit-path and miss-path handling for expert cache behavior
Predictive prefetching or lookahead support where possible
Support for MLX-compatible quantized checkpoints and mixed-precision expert formats
Reference examples for large MoE models such as Qwen-style A3B / A17B models
A Python-accessible API so that users can experiment without maintaining a fully custom C / Metal inference engine

The goal is not necessarily to copy the exact implementation of flash-moe, but to make the underlying idea available in the MLX ecosystem: keep dense execution efficient in MLX, while making sparse expert execution streamable, cacheable, and memory-efficient.

Describe alternatives you've considered

I have considered using the original flash-moe project directly. It demonstrates the core idea very well, but it is a custom C / Objective-C / Metal inference engine rather than a general MLX feature, which makes it harder to integrate with existing MLX workflows, model conversion tools, and Python experimentation.

I have also considered using mlx-flash-style weight streaming. This is promising for running models larger than available RAM, but MoE models have a more specific execution pattern: only a subset of experts is needed for each token. A MoE-aware slot-bank or expert-cache design could be more efficient than generic layer or tensor streaming.

Another alternative is using anemll-flash-mlx as an external toolkit. Its design is closely aligned with the desired direction, especially around stable banks, routed expert IDs, and separating hit-path from miss-path behavior. However, having official or first-class MLX primitives for this would make the approach easier to maintain, optimize, and reuse across different MoE architectures.

Finally, I could simply use smaller quantized MoE models that fit fully in unified memory, but that does not solve the main problem: running large sparse MoE models whose total expert weight size exceeds available memory.

Additional context

This feature request is motivated by recent Flash-MoE and MLX-based streaming experiments:

danveloper/flash-moe: demonstrates SSD-streamed expert execution for a very large MoE model on Apple Silicon, using a custom Metal inference engine.
matt-k-wong/mlx-flash: explores flash weight streaming for MLX, including SSD-backed execution for models larger than available RAM.
Anemll/anemll-flash-mlx: explores Flash-MoE inference for large MoE models on Apple Silicon using MLX, with emphasis on stable slot-banks and expert IDs as data.

A first-class Flash-MoE capability in MLX would be very useful for Apple Silicon users who want to experiment with large sparse MoE models locally, without abandoning the MLX ecosystem or maintaining a separate custom inference stack.

Additional context

This feature request is motivated by recent Flash-MoE and MLX-based streaming experiments.

A first-class Flash-MoE capability in MLX would be very useful for Apple Silicon users who want to experiment with large sparse MoE models locally, without abandoning the MLX ecosystem or maintaining a separate custom inference stack.

References

  • danveloper/flash-moe
    https://github.com/danveloper/flash-moe
    Demonstrates SSD-streamed expert execution for very large MoE models on Apple Silicon using a custom Metal-based inference engine.

  • matt-k-wong/mlx-flash
    https://github.com/matt-k-wong/mlx-flash
    Explores SSD-backed / flash weight streaming for MLX, allowing models larger than available RAM to run by streaming weights from storage.

  • Anemll/anemll-flash-mlx
    https://github.com/Anemll/anemll-flash-mlx
    Explores Flash-MoE inference for large MoE models on Apple Silicon using MLX, with emphasis on expert selection, stable slot-banks, and expert IDs as data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions