GitHub - mstar-project/mstar: A high-performance, universal serving framework for any-to-any models.

A universal serving system for composite, any-to-any multimodal models

Models are dataflow graphs · requests are Walks · one runtime serves them all

Quickstart · Models · How it works · Docs · Blog · Paper

One runtime that matches or beats state-of-the-art inference systems. Full methodology and current numbers in the blog post and paper.

What is M*?

M* (pronounced "M-star") is a serving system for the new generation of composite multimodal models — models built from structurally distinct components (vision encoders, transformer backbones, diffusion and flow heads, audio codecs, action generators, world-model predictors) whose execution path changes with the input and the task.

LLM serving stacks assume inference is a single autoregressive loop. Composite models broke that assumption. M*'s core idea is the Walk Graph: a model is a dataflow graph of its components, and every request is a Walk over that graph. A single runtime serves unified multimodal models, omni models, speech LMs, vision-language-action policies, and world models — at or above the performance of engines specialized for each.

Fast — per-component fast paths, matched to each component's bottleneck:

Paged attention (FlashInfer) and continuous batching for autoregressive backbones
CUDA-graph capture for encoders and decode
Classifier-free-guidance parallelism for diffusion / flow
Sliding-window chunk streaming for audio codecs
Component-level disaggregation with pluggable tensor transport (shared memory, TCP, RDMA)

Flexible — the abstraction mirrors the model:

One small Python file per model declares its component graph and its Walks
A YAML file maps components to GPUs at per-component, per-walk granularity — arbitrary disaggregation, no code changes
Text, image, audio, video, and robot actions, in and out
A Python SDK, an OpenAI-compatible API, and a native streaming endpoint

Roadmap. M* is evolving toward many-model, agentic multimodal serving — routing requests across many models and tools within one graph-scheduled runtime.

Quickstart

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --torch-backend=auto -e .[all]      # install M*
mstar serve bagel          # one command — launch a server (default: http://localhost:8000)

To enable flash-attn support (required for Qwen3-Omni, recommended for BAGEL),

# torch built for CUDA 12.x (cu12)
uv pip install \
  "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"

# torch built for CUDA 13.x (cu13)
uv pip install \
  "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu13torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"

Other models: mstar serve qwen3_omni · mstar serve orpheus · mstar serve pi05 · mstar serve vjepa2

Python SDK — works for every model (text, image, audio, video):

from mstar import MStarClient
client = MStarClient("http://localhost:8000")

client.chat("What is the capital of France?").text          # text
client.generate_image("a cat in a hat")                     # → PNG bytes   (BAGEL)
client.tts("Hello there", voice="tara").to_wav("out.wav")   # → speech      (Orpheus)

for event in client.chat("Tell me a story", stream=True):   # streaming
    print(getattr(event, "text", ""), end="", flush=True)

OpenAI-compatible API — drop-in for bagel, qwen3_omni, and orpheus:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

client.chat.completions.create(model="bagel", messages=[{"role": "user", "content": "hi"}])
client.audio.speech.create(model="orpheus", input="hi", voice="tara")   # text-to-speech
client.images.generate(model="bagel", prompt="a cat")                   # image generation

Runnable scripts and curl examples live in examples/. Power users can launch any deployment with an explicit config: mstar-serve --config configs/<model>.yaml.

Note: The first request(s) on a fresh environment can be slow — often tens of seconds to a few minutes. mstar torch.compiles the model on first use, and that compilation happens lazily on the first request that exercises each path.

Supported models

Model	Family	Input → Output	Endpoints
BAGEL	Unified multimodal	text, image → text, image	`/v1/chat/completions`, `/v1/images/generations`
Qwen3-Omni	Omni	text, image, audio, video → text, speech	`/v1/chat/completions`
Orpheus	Speech LM	text → speech	`/v1/audio/speech`
Pi0.5	Vision-language-action	text, image, state → robot actions	`/generate`
V-JEPA 2 / 2-AC	World model	video (+ actions) → latents, rollouts	`/generate`

Every model is reachable through the SDK and the native /generate endpoint; the OpenAI-compatible routes cover the chat, speech, and image models.

How it works

HTTP / SDK  →  API Server  →  Conductor  →  Workers (one per GPU)  →  streaming results
                                  │              │
                          walks the graph,   own subgraphs; route tensors
                          schedules walks    directly to one another

A model declares a computation graph of components and a set of named Walks (e.g. prefill, decode, image_gen). The Conductor turns each request into a walk over that graph and schedules it; Workers each own a subgraph on their GPU and stream tensors directly to one another. Logical graph structure is decoupled from physical placement, so the same model runs single-GPU or fully disaggregated by changing only the YAML node_groups. Four composable primitives — Sequential, Parallel, Loop, and a cross-partition StreamingGraphEdge — express every model family above. See the paper for the full design.

Performance

Across every model we benchmark, M* matches or beats the system specialized for that family — unified models (BAGEL), omni and speech models (Qwen3-Omni, Orpheus), and world models (V-JEPA 2) — by executing only the components each request needs and giving each its own fast path: paged attention and continuous batching for autoregressive backbones, classifier-free-guidance parallelism for diffusion, chunk streaming for audio codecs, and persistent-cache loops for world-model rollouts.

Benchmark numbers shift as systems evolve — ours and everyone else's — so rather than freeze figures here that go stale, we keep the current results and full methodology in the blog post and the paper.

Contributing

Issues and pull requests are welcome. Found a bug, or want a model or feature supported? Open an issue. To add a model yourself, follow the Adding a New Model guide. PRs to main go through review and CI (ruff).

Citation

If you use M* in your research, please cite:

@article{mstar2026,
  title  = {M*: A Modular, Extensible, Serving System for Multimodal Models},
  author = {Jha, Atindra and Sagan, Naomi and Kamahori, Keisuke and Sivgin, Irmak and
            Sanda, Rohan and Gao, Steven and Horowitz, Mark and Zettlemoyer, Luke and
            Hsu, Olivia and Leskovec, Jure and Kasikci, Baris and Wang, Stephanie},
  year   = {2026},
  eprint = {2606.12688},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG}
}

From Stanford University & the University of Washington. Correspondence: atindra@cs.stanford.edu.

Acknowledgments

M* builds on ideas and proven primitives from the open-source community — paged attention and continuous batching (vLLM), FlashInfer kernels, streaming speech serving (VoxServe), and RDMA tensor transport (Mooncake).

License

Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1,533 Commits
.github		.github
assets		assets
benchmark		benchmark
configs		configs
docs		docs
examples		examples
mstar		mstar
perf_testing		perf_testing
test		test
.gitignore		.gitignore
.sample.env		.sample.env
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A universal serving system for composite, any-to-any multimodal models

What is M*?

Quickstart

Supported models

How it works

Performance

Contributing

Citation

Acknowledgments

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A universal serving system for composite, any-to-any multimodal models

What is M*?

Quickstart

Supported models

How it works

Performance

Contributing

Citation

Acknowledgments

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages