Skip to content

ming_flash_omni: port to mstar (rebase onto origin/main)#115

Open
zhudianGG wants to merge 7 commits into
mainfrom
noah_ming_mstar_port
Open

ming_flash_omni: port to mstar (rebase onto origin/main)#115
zhudianGG wants to merge 7 commits into
mainfrom
noah_ming_mstar_port

Conversation

@zhudianGG

Copy link
Copy Markdown
Collaborator

origin/main renamed the package mminf -> mstar and dropped the new_tokens parameter from Model.get_partition_forward_pass_args. This ports the complete Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series — understanding + talker + config-tail/image-gen, steps 1-10) onto the current main:

  • moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and rewrote all mminf -> mstar references (package + 23 modular tests + 2 configs + benchmark class).
  • re-applied the two framework edits onto main's versions: the registry entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that skips graph walks whose nodes aren't in the deploy's node_groups.
  • dropped the new_tokens arg from our get_partition_forward_pass_args override to match the new base signature, and from the test call sites.
  • fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is now a registered submodule + partition, so the "unknown" assertions use a genuinely unknown name).

356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not port regressions.

What does this PR do?

How was it tested?

Checklist

  • ruff check . passes
  • Added or updated tests / docs where relevant

@zhudianGG

zhudianGG commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Benchmark Results

The numbers below are directional H100-class serving evidence for vllm-omni serving Ming-flash-omni-2.0 on 4×H100 under matched prompts, sampling settings, and decode parameters. They are useful for setting expectations, not universal guarantees; keep the caveats with the numbers when quoting them.


T2T Scaling Sweep

Pure text-to-text path, concurrency sweep from offline batch to closed-loop at varying parallelism levels. Full results: results/ming_t2t_sweep/SUMMARY.md.

Mode Concurrency Reqs Wall (s) E2E p50 (ms) E2E p95 (ms) req/s tok/s
OFFLINE 1 50 69.14 1444 2310 0.72 109.6
CLOSED_LOOP 2 80 61.57 1436 2536 1.30 198.9
CLOSED_LOOP 4 80 33.94 1588 2846 2.36 355.7
CLOSED_LOOP 8 80 21.54 1899 3396 3.71 573.4
CLOSED_LOOP 16 80 13.78 2144 4175 5.81 887.9
CLOSED_LOOP 32 80 11.50 3728 7384 6.96 1060.5

Single-stream throughput is ~110 tok/s, bounded by TP=4 all-reduce overhead. Throughput scales linearly up to c=8 (5.2×), with a knee between c=16 and c=32 (9.6× peak). All 470 requests succeeded.


Per-Modality Smoke Runs

One representative run per modality at low concurrency to verify end-to-end pipeline correctness. Full results: results/ming_*/results.json.

Task Result
T2T offline B=1 0.78 req/s; JCT p95 2.08 s
I2T ~1.18 req/s; JCT median 290 ms
A2T English transcription + Chinese equivalence check passed
V2T yoga / cup_change / audioqa mp4s; 1 req/s class
T2S RTF ~0.14; JCT median 11.4 s
I2S ~0.13 req/s; ~7–12 s/req
V2S 8.6–10.8 s/req; 2.5–3 MB WAV @ 44.1 kHz
A2S 1.6–10 s/req; 0.5–3 MB WAV @ 44.1 kHz
T2I / I2I Not wired (requires ming_flash_omni image-gen endpoint)

Accuracy Spot-Checks

Accuracy evaluated on standard benchmarks with greedy decoding (T=0), TP=4. Full results: results/ming_accuracy/ACCURACY.md.

Suite Items Accuracy Pass rate
MMLU (0-shot, ~5/subj) 285 78.9% 99%+
VideoMME (chunk1 subset) 51 56.9% 100%

MMLU accuracy is consistent with published Ming-flash-omni-2.0 reported figures. VideoMME is a small subset and should be treated as a smoke-check rather than a full evaluation.

@zhudianGG

Copy link
Copy Markdown
Collaborator Author

Sglang-omni reported benchmark:

Benchmark Results

The numbers below are directional H100-class serving evidence for SGLang-Omni serving Ming-flash-omni-2.0 under matched prompts, sampling settings, and decode parameters. They are useful for setting expectations, not universal guarantees; keep the caveats with the numbers when quoting them.

Text Thinker (GSM8K)

Pure text thinker path, text-only output, 100 samples from the GSM8K main test split (first 100 of 1319 problems, deterministic file order), greedy (T=0), TP=4 thinker.

Concurrency Throughput Mean latency Accuracy
1 0.615 qps 1.63 s 94%
4 1.938 qps 2.06 s 95%
16 4.608 qps 3.26 s 95%

Throughput scales roughly linearly from c=1 to c=16 (~7.5×) at stable accuracy.

Image-Text (MMMU)

Image-text input, text output, 50 samples from the full MMMU/MMMU validation split (all 30 subjects, sorted by sample id, first 50 with images — not the zhaochenyang20/mmmu-ci-50 CI subset), greedy (T=0), TP=4 thinker.

Concurrency Throughput Mean latency Median latency Accuracy
1 0.144 qps 6.70 s 3.48 s 60%
2 0.251 qps 7.69 s 4.35 s 64%
4 0.454 qps 8.47 s 4.89 s 66%
8 0.720 qps 10.47 s 6.25 s 64%
16 0.996 qps 14.16 s 8.92 s 62%

Throughput scales ~6.9× from c=1 to c=16; accuracy stays within MMMU sample noise.

Non-Streaming Talker

Speech output (modalities=["text","audio"]), voice DB30, uniform prompt, TP=4 thinker + dedicated talker GPU. Measured against the 7-stage non-streaming MingOmniSpeechPipelineConfig with stream=false (not the streaming pipeline); every request returned real 44.1 kHz audio (n_fail=0, mean ~6.3 s/clip).

Concurrency Throughput Mean wall p95 wall
1 2.02 req/s 0.493 s 0.522 s
2 2.87 req/s 0.68 s 0.74 s
4 3.01 req/s 1.25 s 1.39 s
8 2.92 req/s 2.35 s 2.77 s
16 3.03 req/s 3.70 s 5.01 s

The talker is single-stream (SimpleScheduler.max_concurrency=1), which enables the CFM CUDA-graph capture and keeps c=1 latency low. Throughput plateaus near 3 req/s at high concurrency.

Streaming Talker

Streaming speech is a low-concurrency UX path: it trades some throughput for much earlier first audio. Same backend, same voice DB30. First audio is time-to-first-audio-chunk (TTFA) for streaming and full-response wall time for non-streaming.

Concurrency Streaming TTFA Non-streaming first audio Streaming throughput Non-streaming throughput
1 0.236 s 0.509 s 1.206 req/s 1.956 req/s
2 0.593 s 0.653 s 1.288 req/s 2.989 req/s
4 1.269 s 1.260 s 1.368 req/s 2.990 req/s
8 2.675 s 2.280 s 1.410 req/s 3.006 req/s
16 4.474 s 3.697 s 1.448 req/s 3.011 req/s

At c=1 streaming delivers first audio ~2.2× sooner at a ~38% throughput cost. The crossover is around c≈4; past it, single-stream queuing makes streaming's first chunk arrive later than the non-streaming full response. Each streaming request emits ~20 chunks at ~19 ms intervals. The streaming measurements are from PR/local-patch evidence, not a release-wide guarantee; cite streaming as a low-concurrency first-audio win rather than a universal throughput win.

Audio Equivalence

A small c=1 audit (single prompt, single voice, n=4 WAVs per mode) checks that the streaming path preserves audio content versus non-streaming on the same backend.

Comparison Result
Streaming vs non-streaming Intelligible-equivalent: CER 0/0 on both, mel-L2 cross-mode ~1.5× within-mode baseline, duration delta <3%.

Intelligibility is fully preserved; streaming and non-streaming are close but not bit-identical (expected from chunked-decode windowing).

origin/main renamed the package mminf -> mstar and dropped the `new_tokens`
parameter from Model.get_partition_forward_pass_args. This ports the complete
Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series
— understanding + talker + config-tail/image-gen, steps 1-10) onto the current
main:

  * moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and
    rewrote all mminf -> mstar references (package + 23 modular tests + 2
    configs + benchmark class).
  * re-applied the two framework edits onto main's versions: the registry
    entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that
    skips graph walks whose nodes aren't in the deploy's node_groups.
  * dropped the `new_tokens` arg from our get_partition_forward_pass_args
    override to match the new base signature, and from the test call sites.
  * fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is
    now a registered submodule + partition, so the "unknown" assertions use a
    genuinely unknown name).

356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures
are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not
port regressions.
The worker and conductor form a PUSH/PULL cycle: each drains its PULL in the
same loop that issues PUSH sends. ZMQCommunicator.send used a blocking
send_pyobj, so when a peer's receive buffer filled (large Ming-flash-omni
output tensors under concurrent load), the send blocked, the caller stopped
draining its own PULL, and the two peers deadlocked — server stalled at 0%
GPU util with requests hung. Reproduced with 8 concurrent requests; py-spy
showed the worker parked in zmq send inside _send_outputs.

Make send non-blocking: raise SNDHWM, then send with zmq.NOBLOCK and, on
zmq.Again, queue the message in a per-peer in-process deque flushed
opportunistically (before each send, on every poll, and in wait_for_work).
A full peer can no longer block the sender's drain loop, so the cycle can
always make progress; FIFO order and delivery are preserved. After this,
the same 8-concurrent pattern and the full benchmark sweep (470 requests)
complete with 0 failures.

Also widen the benchmark client's session timeout from a whole-session
total=300 (which expired mid-sweep) to a per-request sock_read=120, so a
slow-but-healthy request no longer spuriously fails the rest of the sweep.

Adds test/modular/test_communicator_backpressure.py covering non-blocking
send with no receiver and overflow->queue->ordered-delivery.
Adds an equal-split all-to-all collective to TPCommGroup: input
(world_size * chunk, ...) is split into world_size equal chunks along dim
0, chunk i sent to rank i, output same shape. Equal split (no per-rank
split sizes) keeps shapes static, which makes it CUDA-graph-capturable.

This is the enabling primitive for capacity-padded expert-parallel (EP)
MoE: EP dispatch/combine move tokens between expert-owning ranks via
all-to-all, and the decode path runs under a captured CUDA graph, so the
all-to-all must capture. The data-dependent token counts are handled by
the caller padding to a fixed per-expert capacity so the split stays
equal.

Adds a 4-GPU distributed test (manual torchrun launch) covering equal-split
correctness AND CUDA-graph capture+replay equivalence — the gating check
for the EP design. Verified PASS on 4xH100.
The thinker's rope recomputed its cos/sin tables (plus the video_rope
remap) inside every layer's attention via rotary(q, k, position_ids).
cos/sin depend only on position_ids (and the static inv_freq), not on
q/k, so for the 32-layer stack the tables were identical across all
layers and recomputed 32x per forward.

Split the rope into compute_cos_sin(position_ids) — called once in
LingMoeModel.forward — and apply_cos_sin(q, k, cos, sin) — the cheap
per-layer rotation. Thread the precomputed (cos, sin) down through
decoder_layer -> attention; forward(q, k, position_ids) is kept as a
back-compat wrapper. Output is bit-identical (verified max diff 0.0).

Note: this does NOT change graph-captured decode throughput — that path
already amortizes the recompute inside one fused CUDA-graph replay, and
the decode step is latency-bound (MoE GEMM + TP all_reduce), not
rope-compute-bound (profiled: rope ~33% of eager compute but the live
decode step is dominated by other costs). The win is on the eager
prefill path and reduced graph-capture cost; it's a redundant-work
cleanup, not a decode speedup.
@zhudianGG zhudianGG force-pushed the noah_ming_mstar_port branch from d5e25ae to bed41c0 Compare June 16, 2026 17:19
The thinker decode walk ran eager, one request at a time: can_batch
returned False, there was no forward_batched, and no get_cuda_graph_configs
(so warmup captured nothing). Single-stream decode was ~16.6 tok/s and
throughput was flat across concurrency (~0.15 req/s at every batch size).

Implement decode batching + CUDA graphs (the two are coupled — the engine's
BasicBatchedCudaGraphConfig always routes through forward_batched, even at
bs=1, so graphs require batching):

- get_cuda_graph_configs: declare a BasicBatchedCudaGraphConfig for
  thinker_decode across bs=[1,2,4,8,16,32]. Prefill stays eager for now
  (variable token counts need the FlashInferPacked path — follow-up).
- forward_batched: run the packed batch through LingMoeModel in one forward
  (decode = 1 token/request, so logits are (bs, V), no last-token gather),
  return per-rid {logits} + a __batched_logits__ sentinel so the engine /
  graph runner samples the whole batch at once.
- can_batch: True for the decode walks only (prefill + multimodal-embeds
  stay sequential).
- batched preprocess: pack input_ids + 3D positions across requests and
  plan_attention with per-request seq_lens; drop the single-request
  NotImplementedError.
- Graph-capturable positions: decode positions now flow
  prepare_inputs -> preprocess -> forward as a static tensor instead of a
  torch.arange rebuilt inside forward (the runner interns it and refreshes
  it per replay). Emitted BATCH-FIRST as (total_tokens, 3) because the
  static-buffer interning slices the leading dim and requires constant
  trailing dims; _positions_for_model transposes back to (3, total_tokens)
  for Ming's inline partial-3D rope. (qwen3_omni feeds decode through mrope
  cos/sin buffers instead, so its pattern doesn't transfer directly.)

Verified on 4xH100 TP=4: decode graphs capture (0 decode graph misses),
correct output, single-stream ~110 tok/s (was 16.6), and 8 concurrent
requests complete in ~1.5s (previously serialized). Full T2T sweep: 0/470
failures, req/s scales 0.93 -> 4.29 across concurrency 1 -> 32.
BailingMoeV2ThinkerSubmodule.check_stop only stopped the thinker_decode_loop
on EOS. Prompts where the model never emits EOS (list-style / open-ended)
decoded until the client timeout, also degenerating into repetition. Add the
max_tokens budget guard, mirroring qwen3_omni's ThinkerSubmodule.check_stop:
stop when dynamic_loop_iter_counts[thinker_decode_loop]+1 >= max_tokens.
The engine manager passes the resolved autocast_dtype into get_submodule
(landed in the general dtype-alloc fix). Make the Ming thinker honor it:
cast the meta model to that dtype before to_empty so params allocate
directly in the target dtype, falling back to get_autocast_dtype() for
direct callers that pass None.
@zhudianGG zhudianGG force-pushed the noah_ming_mstar_port branch from bed41c0 to b18ddd6 Compare June 16, 2026 17:20
@zhudianGG

Copy link
Copy Markdown
Collaborator Author

ming_flash_omni: port to mstar + native serving optimizations

Ports Ming-flash-omni-2.0 to the M* (mstar) Walk-graph runtime and adds the serving-path optimizations that make the native path competitive with vLLM-omni at low/mid concurrency.

Model port

Full Ling-2.0 thinker + multimodal stack under mstar/model/ming_omni_flash/: thinker LLM (sparse MoE, 256 experts top-8, partial-3D MRoPE), vision/audio encoders + projectors, image-gen (zimage/talker DiT) and TTS paths, config, weight loader, and engine submodules. TP=4 thinker-only serving wired via configs/ming_flash_omni_thinker_only_tp4.yaml. Registered in mstar/model/registry.py.

Serving optimizations

  • Batched decode + thinker_decode CUDA graphs (submodules.py): decode is captured per batch size (BasicBatchedCudaGraphConfig, bs=[1,2,4,8,16,32]) with request batching (forward_batched/can_batch). ~6× single-stream decode (~16 → ~104 tok/s) and throughput scales 1.05 → 4.85 req/s (c=1 → 32).
  • RoPE cos/sin hoist (components/rope.py, model.py, decoder_layer.py, attention.py): split into compute_cos_sin + apply_cos_sin, computed once per forward instead of per layer.
  • check_stop max_tokens guard (submodules.py): stop thinker_decode_loop on EOS or when the per-request max_tokens budget is reached, preventing runaway decode on prompts that never emit EOS. Guarded for request_info=None callers.
  • Non-blocking worker↔conductor sends (comm): replaces blocking ZMQ PUSH/PULL with non-blocking sends + per-peer outbound queue to break a deadlock under concurrency.
  • autocast_dtype in get_submodule: consumes the new get_submodule(autocast_dtype=...) interface from [model] allocate submodule params in autocast_dtype to avoid fp32 weight-load OOM #120 so meta params are allocated in bf16 (avoids the fp32 weight-load OOM at TP=4).
  • all_to_all_single on TPCommGroup (mstar/distributed/communication.py): graph-capturable equal-split all-to-all primitive (reusable collective).

Performance (4×H100 80GB, T2T sweep, thinker-only TP=4)

concurrency native req/s vllm-omni req/s winner
1 1.05 0.72 native
2 1.77 1.30 native
4 2.44 2.36 native
8 3.12 3.71 vllm
16 3.55 5.81 vllm
32 4.85 6.96 vllm

Native beats vLLM-omni through c=4 (decode graphs win at low concurrency) and is competitive at c=8; the high-concurrency gap is structural (balanced decode step, no single crushable bottleneck on 4-GPU single-node). 0/470 request failures.

Notes

Test plan

  • pytest test/modular/test_ming_flash_omni_*.py (model/loader/config/submodules)
  • pytest test/modular/test_communicator_backpressure.py
  • TP=4 serve + T2T sweep reproduces the table above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant