ming_flash_omni: port to mstar (rebase onto origin/main) by zhudianGG · Pull Request #115 · mstar-project/mstar

zhudianGG · 2026-06-11T14:39:16Z

origin/main renamed the package mminf -> mstar and dropped the new_tokens parameter from Model.get_partition_forward_pass_args. This ports the complete Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series — understanding + talker + config-tail/image-gen, steps 1-10) onto the current main:

moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and rewrote all mminf -> mstar references (package + 23 modular tests + 2 configs + benchmark class).
re-applied the two framework edits onto main's versions: the registry entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that skips graph walks whose nodes aren't in the deploy's node_groups.
dropped the new_tokens arg from our get_partition_forward_pass_args override to match the new base signature, and from the test call sites.
fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is now a registered submodule + partition, so the "unknown" assertions use a genuinely unknown name).

356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not port regressions.

What does this PR do?

How was it tested?

Checklist

ruff check . passes
Added or updated tests / docs where relevant

zhudianGG · 2026-06-12T18:18:46Z

Benchmark Results

The numbers below are directional H100-class serving evidence for vllm-omni serving Ming-flash-omni-2.0 on 4×H100 under matched prompts, sampling settings, and decode parameters. They are useful for setting expectations, not universal guarantees; keep the caveats with the numbers when quoting them.

T2T Scaling Sweep

Pure text-to-text path, concurrency sweep from offline batch to closed-loop at varying parallelism levels. Full results: results/ming_t2t_sweep/SUMMARY.md.

Mode	Concurrency	Reqs	Wall (s)	E2E p50 (ms)	E2E p95 (ms)	req/s	tok/s
OFFLINE	1	50	69.14	1444	2310	0.72	109.6
CLOSED_LOOP	2	80	61.57	1436	2536	1.30	198.9
CLOSED_LOOP	4	80	33.94	1588	2846	2.36	355.7
CLOSED_LOOP	8	80	21.54	1899	3396	3.71	573.4
CLOSED_LOOP	16	80	13.78	2144	4175	5.81	887.9
CLOSED_LOOP	32	80	11.50	3728	7384	6.96	1060.5

Single-stream throughput is ~110 tok/s, bounded by TP=4 all-reduce overhead. Throughput scales linearly up to c=8 (5.2×), with a knee between c=16 and c=32 (9.6× peak). All 470 requests succeeded.

Per-Modality Smoke Runs

One representative run per modality at low concurrency to verify end-to-end pipeline correctness. Full results: results/ming_*/results.json.

Task	Result
T2T offline B=1	0.78 req/s; JCT p95 2.08 s
I2T	~1.18 req/s; JCT median 290 ms
A2T	English transcription + Chinese equivalence check passed
V2T	yoga / cup_change / audioqa mp4s; 1 req/s class
T2S	RTF ~0.14; JCT median 11.4 s
I2S	~0.13 req/s; ~7–12 s/req
V2S	8.6–10.8 s/req; 2.5–3 MB WAV @ 44.1 kHz
A2S	1.6–10 s/req; 0.5–3 MB WAV @ 44.1 kHz
T2I / I2I	Not wired (requires `ming_flash_omni` image-gen endpoint)

Accuracy Spot-Checks

Accuracy evaluated on standard benchmarks with greedy decoding (T=0), TP=4. Full results: results/ming_accuracy/ACCURACY.md.

Suite	Items	Accuracy	Pass rate
MMLU (0-shot, ~5/subj)	285	78.9%	99%+
VideoMME (chunk1 subset)	51	56.9%	100%

MMLU accuracy is consistent with published Ming-flash-omni-2.0 reported figures. VideoMME is a small subset and should be treated as a smoke-check rather than a full evaluation.

zhudianGG · 2026-06-12T18:24:13Z

Sglang-omni reported benchmark:

Benchmark Results

The numbers below are directional H100-class serving evidence for SGLang-Omni serving Ming-flash-omni-2.0 under matched prompts, sampling settings, and decode parameters. They are useful for setting expectations, not universal guarantees; keep the caveats with the numbers when quoting them.

Text Thinker (GSM8K)

Pure text thinker path, text-only output, 100 samples from the GSM8K main test split (first 100 of 1319 problems, deterministic file order), greedy (T=0), TP=4 thinker.

Concurrency	Throughput	Mean latency	Accuracy
1	`0.615 qps`	`1.63 s`	94%
4	`1.938 qps`	`2.06 s`	95%
16	`4.608 qps`	`3.26 s`	95%

Throughput scales roughly linearly from c=1 to c=16 (~7.5×) at stable accuracy.

Image-Text (MMMU)

Image-text input, text output, 50 samples from the full MMMU/MMMU validation split (all 30 subjects, sorted by sample id, first 50 with images — not the zhaochenyang20/mmmu-ci-50 CI subset), greedy (T=0), TP=4 thinker.

Concurrency	Throughput	Mean latency	Median latency	Accuracy
1	`0.144 qps`	`6.70 s`	`3.48 s`	60%
2	`0.251 qps`	`7.69 s`	`4.35 s`	64%
4	`0.454 qps`	`8.47 s`	`4.89 s`	66%
8	`0.720 qps`	`10.47 s`	`6.25 s`	64%
16	`0.996 qps`	`14.16 s`	`8.92 s`	62%

Throughput scales ~6.9× from c=1 to c=16; accuracy stays within MMMU sample noise.

Non-Streaming Talker

Speech output (modalities=["text","audio"]), voice DB30, uniform prompt, TP=4 thinker + dedicated talker GPU. Measured against the 7-stage non-streaming MingOmniSpeechPipelineConfig with stream=false (not the streaming pipeline); every request returned real 44.1 kHz audio (n_fail=0, mean ~6.3 s/clip).

Concurrency	Throughput	Mean wall	p95 wall
1	`2.02 req/s`	`0.493 s`	`0.522 s`
2	`2.87 req/s`	`0.68 s`	`0.74 s`
4	`3.01 req/s`	`1.25 s`	`1.39 s`
8	`2.92 req/s`	`2.35 s`	`2.77 s`
16	`3.03 req/s`	`3.70 s`	`5.01 s`

The talker is single-stream (SimpleScheduler.max_concurrency=1), which enables the CFM CUDA-graph capture and keeps c=1 latency low. Throughput plateaus near 3 req/s at high concurrency.

Streaming Talker

Streaming speech is a low-concurrency UX path: it trades some throughput for much earlier first audio. Same backend, same voice DB30. First audio is time-to-first-audio-chunk (TTFA) for streaming and full-response wall time for non-streaming.

Concurrency	Streaming TTFA	Non-streaming first audio	Streaming throughput	Non-streaming throughput
1	`0.236 s`	`0.509 s`	`1.206 req/s`	`1.956 req/s`
2	`0.593 s`	`0.653 s`	`1.288 req/s`	`2.989 req/s`
4	`1.269 s`	`1.260 s`	`1.368 req/s`	`2.990 req/s`
8	`2.675 s`	`2.280 s`	`1.410 req/s`	`3.006 req/s`
16	`4.474 s`	`3.697 s`	`1.448 req/s`	`3.011 req/s`

At c=1 streaming delivers first audio ~2.2× sooner at a ~38% throughput cost. The crossover is around c≈4; past it, single-stream queuing makes streaming's first chunk arrive later than the non-streaming full response. Each streaming request emits ~20 chunks at ~19 ms intervals. The streaming measurements are from PR/local-patch evidence, not a release-wide guarantee; cite streaming as a low-concurrency first-audio win rather than a universal throughput win.

Audio Equivalence

A small c=1 audit (single prompt, single voice, n=4 WAVs per mode) checks that the streaming path preserves audio content versus non-streaming on the same backend.

Comparison	Result
Streaming vs non-streaming	Intelligible-equivalent: CER 0/0 on both, mel-L2 cross-mode ~1.5× within-mode baseline, duration delta <3%.

Intelligibility is fully preserved; streaming and non-streaming are close but not bit-identical (expected from chunked-decode windowing).

origin/main renamed the package mminf -> mstar and dropped the `new_tokens` parameter from Model.get_partition_forward_pass_args. This ports the complete Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series — understanding + talker + config-tail/image-gen, steps 1-10) onto the current main: * moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and rewrote all mminf -> mstar references (package + 23 modular tests + 2 configs + benchmark class). * re-applied the two framework edits onto main's versions: the registry entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that skips graph walks whose nodes aren't in the deploy's node_groups. * dropped the `new_tokens` arg from our get_partition_forward_pass_args override to match the new base signature, and from the test call sites. * fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is now a registered submodule + partition, so the "unknown" assertions use a genuinely unknown name). 356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not port regressions.

The worker and conductor form a PUSH/PULL cycle: each drains its PULL in the same loop that issues PUSH sends. ZMQCommunicator.send used a blocking send_pyobj, so when a peer's receive buffer filled (large Ming-flash-omni output tensors under concurrent load), the send blocked, the caller stopped draining its own PULL, and the two peers deadlocked — server stalled at 0% GPU util with requests hung. Reproduced with 8 concurrent requests; py-spy showed the worker parked in zmq send inside _send_outputs. Make send non-blocking: raise SNDHWM, then send with zmq.NOBLOCK and, on zmq.Again, queue the message in a per-peer in-process deque flushed opportunistically (before each send, on every poll, and in wait_for_work). A full peer can no longer block the sender's drain loop, so the cycle can always make progress; FIFO order and delivery are preserved. After this, the same 8-concurrent pattern and the full benchmark sweep (470 requests) complete with 0 failures. Also widen the benchmark client's session timeout from a whole-session total=300 (which expired mid-sweep) to a per-request sock_read=120, so a slow-but-healthy request no longer spuriously fails the rest of the sweep. Adds test/modular/test_communicator_backpressure.py covering non-blocking send with no receiver and overflow->queue->ordered-delivery.

Adds an equal-split all-to-all collective to TPCommGroup: input (world_size * chunk, ...) is split into world_size equal chunks along dim 0, chunk i sent to rank i, output same shape. Equal split (no per-rank split sizes) keeps shapes static, which makes it CUDA-graph-capturable. This is the enabling primitive for capacity-padded expert-parallel (EP) MoE: EP dispatch/combine move tokens between expert-owning ranks via all-to-all, and the decode path runs under a captured CUDA graph, so the all-to-all must capture. The data-dependent token counts are handled by the caller padding to a fixed per-expert capacity so the split stays equal. Adds a 4-GPU distributed test (manual torchrun launch) covering equal-split correctness AND CUDA-graph capture+replay equivalence — the gating check for the EP design. Verified PASS on 4xH100.

The thinker's rope recomputed its cos/sin tables (plus the video_rope remap) inside every layer's attention via rotary(q, k, position_ids). cos/sin depend only on position_ids (and the static inv_freq), not on q/k, so for the 32-layer stack the tables were identical across all layers and recomputed 32x per forward. Split the rope into compute_cos_sin(position_ids) — called once in LingMoeModel.forward — and apply_cos_sin(q, k, cos, sin) — the cheap per-layer rotation. Thread the precomputed (cos, sin) down through decoder_layer -> attention; forward(q, k, position_ids) is kept as a back-compat wrapper. Output is bit-identical (verified max diff 0.0). Note: this does NOT change graph-captured decode throughput — that path already amortizes the recompute inside one fused CUDA-graph replay, and the decode step is latency-bound (MoE GEMM + TP all_reduce), not rope-compute-bound (profiled: rope ~33% of eager compute but the live decode step is dominated by other costs). The win is on the eager prefill path and reduced graph-capture cost; it's a redundant-work cleanup, not a decode speedup.

The thinker decode walk ran eager, one request at a time: can_batch returned False, there was no forward_batched, and no get_cuda_graph_configs (so warmup captured nothing). Single-stream decode was ~16.6 tok/s and throughput was flat across concurrency (~0.15 req/s at every batch size). Implement decode batching + CUDA graphs (the two are coupled — the engine's BasicBatchedCudaGraphConfig always routes through forward_batched, even at bs=1, so graphs require batching): - get_cuda_graph_configs: declare a BasicBatchedCudaGraphConfig for thinker_decode across bs=[1,2,4,8,16,32]. Prefill stays eager for now (variable token counts need the FlashInferPacked path — follow-up). - forward_batched: run the packed batch through LingMoeModel in one forward (decode = 1 token/request, so logits are (bs, V), no last-token gather), return per-rid {logits} + a __batched_logits__ sentinel so the engine / graph runner samples the whole batch at once. - can_batch: True for the decode walks only (prefill + multimodal-embeds stay sequential). - batched preprocess: pack input_ids + 3D positions across requests and plan_attention with per-request seq_lens; drop the single-request NotImplementedError. - Graph-capturable positions: decode positions now flow prepare_inputs -> preprocess -> forward as a static tensor instead of a torch.arange rebuilt inside forward (the runner interns it and refreshes it per replay). Emitted BATCH-FIRST as (total_tokens, 3) because the static-buffer interning slices the leading dim and requires constant trailing dims; _positions_for_model transposes back to (3, total_tokens) for Ming's inline partial-3D rope. (qwen3_omni feeds decode through mrope cos/sin buffers instead, so its pattern doesn't transfer directly.) Verified on 4xH100 TP=4: decode graphs capture (0 decode graph misses), correct output, single-stream ~110 tok/s (was 16.6), and 8 concurrent requests complete in ~1.5s (previously serialized). Full T2T sweep: 0/470 failures, req/s scales 0.93 -> 4.29 across concurrency 1 -> 32.

BailingMoeV2ThinkerSubmodule.check_stop only stopped the thinker_decode_loop on EOS. Prompts where the model never emits EOS (list-style / open-ended) decoded until the client timeout, also degenerating into repetition. Add the max_tokens budget guard, mirroring qwen3_omni's ThinkerSubmodule.check_stop: stop when dynamic_loop_iter_counts[thinker_decode_loop]+1 >= max_tokens.

The engine manager passes the resolved autocast_dtype into get_submodule (landed in the general dtype-alloc fix). Make the Ming thinker honor it: cast the meta model to that dtype before to_empty so params allocate directly in the target dtype, falling back to get_autocast_dtype() for direct callers that pass None.

zhudianGG · 2026-06-16T18:00:01Z

ming_flash_omni: port to mstar + native serving optimizations

Ports Ming-flash-omni-2.0 to the M* (mstar) Walk-graph runtime and adds the serving-path optimizations that make the native path competitive with vLLM-omni at low/mid concurrency.

Model port

Full Ling-2.0 thinker + multimodal stack under mstar/model/ming_omni_flash/: thinker LLM (sparse MoE, 256 experts top-8, partial-3D MRoPE), vision/audio encoders + projectors, image-gen (zimage/talker DiT) and TTS paths, config, weight loader, and engine submodules. TP=4 thinker-only serving wired via configs/ming_flash_omni_thinker_only_tp4.yaml. Registered in mstar/model/registry.py.

Serving optimizations

Batched decode + thinker_decode CUDA graphs (submodules.py): decode is captured per batch size (BasicBatchedCudaGraphConfig, bs=[1,2,4,8,16,32]) with request batching (forward_batched/can_batch). ~6× single-stream decode (~16 → ~104 tok/s) and throughput scales 1.05 → 4.85 req/s (c=1 → 32).
RoPE cos/sin hoist (components/rope.py, model.py, decoder_layer.py, attention.py): split into compute_cos_sin + apply_cos_sin, computed once per forward instead of per layer.
check_stop max_tokens guard (submodules.py): stop thinker_decode_loop on EOS or when the per-request max_tokens budget is reached, preventing runaway decode on prompts that never emit EOS. Guarded for request_info=None callers.
Non-blocking worker↔conductor sends (comm): replaces blocking ZMQ PUSH/PULL with non-blocking sends + per-peer outbound queue to break a deadlock under concurrency.
autocast_dtype in get_submodule: consumes the new get_submodule(autocast_dtype=...) interface from [model] allocate submodule params in autocast_dtype to avoid fp32 weight-load OOM #120 so meta params are allocated in bf16 (avoids the fp32 weight-load OOM at TP=4).
all_to_all_single on TPCommGroup (mstar/distributed/communication.py): graph-capturable equal-split all-to-all primitive (reusable collective).

Performance (4×H100 80GB, T2T sweep, thinker-only TP=4)

concurrency	native req/s	vllm-omni req/s	winner
1	1.05	0.72	native
2	1.77	1.30	native
4	2.44	2.36	native
8	3.12	3.71	vllm
16	3.55	5.81	vllm
32	4.85	6.96	vllm

Native beats vLLM-omni through c=4 (decode graphs win at low concurrency) and is competitive at c=8; the high-concurrency gap is structural (balanced decode step, no single crushable bottleneck on 4-GPU single-node). 0/470 request failures.

Notes

Rebased onto current main (includes the merged [model] allocate submodule params in autocast_dtype to avoid fp32 weight-load OOM #120 dtype fix; not re-introduced here).
An expert-parallel MoE was prototyped to close the high-concurrency gap but did not pay off at TP=4 single-node (all-to-all overhead exceeded the all_reduce it replaced); not included. The reusable all_to_all_single primitive is kept.

Test plan

pytest test/modular/test_ming_flash_omni_*.py (model/loader/config/submodules)
pytest test/modular/test_communicator_backpressure.py
TP=4 serve + T2T sweep reproduces the table above

zhudianGG mentioned this pull request Jun 11, 2026

[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill) #104

Closed

zhudianGG marked this pull request as ready for review June 11, 2026 14:40

zhudianGG requested review from kamahori and merceod June 11, 2026 14:41

zhudianGG added 4 commits June 16, 2026 09:33

zhudianGG force-pushed the noah_ming_mstar_port branch from d5e25ae to bed41c0 Compare June 16, 2026 17:19

zhudianGG added 3 commits June 16, 2026 17:19

zhudianGG force-pushed the noah_ming_mstar_port branch from bed41c0 to b18ddd6 Compare June 16, 2026 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ming_flash_omni: port to mstar (rebase onto origin/main)#115

ming_flash_omni: port to mstar (rebase onto origin/main)#115
zhudianGG wants to merge 7 commits into
mainfrom
noah_ming_mstar_port

zhudianGG commented Jun 11, 2026

Uh oh!

zhudianGG commented Jun 12, 2026 •

edited

Loading

Uh oh!

zhudianGG commented Jun 12, 2026

Uh oh!

zhudianGG commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhudianGG commented Jun 11, 2026

What does this PR do?

How was it tested?

Checklist

Uh oh!

zhudianGG commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

T2T Scaling Sweep

Per-Modality Smoke Runs

Accuracy Spot-Checks

Uh oh!

zhudianGG commented Jun 12, 2026

Benchmark Results

Text Thinker (GSM8K)

Image-Text (MMMU)

Non-Streaming Talker

Streaming Talker

Audio Equivalence

Uh oh!

zhudianGG commented Jun 16, 2026

ming_flash_omni: port to mstar + native serving optimizations

Model port

Serving optimizations

Performance (4×H100 80GB, T2T sweep, thinker-only TP=4)

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhudianGG commented Jun 12, 2026 •

edited

Loading