ming_flash_omni: port to mstar (rebase onto origin/main)#115
Conversation
Benchmark ResultsThe numbers below are directional H100-class serving evidence for vllm-omni serving Ming-flash-omni-2.0 on 4×H100 under matched prompts, sampling settings, and decode parameters. They are useful for setting expectations, not universal guarantees; keep the caveats with the numbers when quoting them. T2T Scaling SweepPure text-to-text path, concurrency sweep from offline batch to closed-loop at varying parallelism levels. Full results:
Single-stream throughput is ~110 tok/s, bounded by TP=4 all-reduce overhead. Throughput scales linearly up to c=8 (5.2×), with a knee between c=16 and c=32 (9.6× peak). All 470 requests succeeded. Per-Modality Smoke RunsOne representative run per modality at low concurrency to verify end-to-end pipeline correctness. Full results:
Accuracy Spot-ChecksAccuracy evaluated on standard benchmarks with greedy decoding (T=0), TP=4. Full results:
MMLU accuracy is consistent with published Ming-flash-omni-2.0 reported figures. VideoMME is a small subset and should be treated as a smoke-check rather than a full evaluation. |
|
Sglang-omni reported benchmark: Benchmark ResultsThe numbers below are directional H100-class serving evidence for SGLang-Omni serving Ming-flash-omni-2.0 under matched prompts, sampling settings, and decode parameters. They are useful for setting expectations, not universal guarantees; keep the caveats with the numbers when quoting them. Text Thinker (GSM8K)Pure text thinker path, text-only output, 100 samples from the GSM8K
Throughput scales roughly linearly from c=1 to c=16 (~7.5×) at stable accuracy. Image-Text (MMMU)Image-text input, text output, 50 samples from the full
Throughput scales ~6.9× from c=1 to c=16; accuracy stays within MMMU sample noise. Non-Streaming TalkerSpeech output (
The talker is single-stream ( Streaming TalkerStreaming speech is a low-concurrency UX path: it trades some throughput for much earlier first audio. Same backend, same voice
At c=1 streaming delivers first audio ~2.2× sooner at a ~38% throughput cost. The crossover is around c≈4; past it, single-stream queuing makes streaming's first chunk arrive later than the non-streaming full response. Each streaming request emits ~20 chunks at ~19 ms intervals. The streaming measurements are from PR/local-patch evidence, not a release-wide guarantee; cite streaming as a low-concurrency first-audio win rather than a universal throughput win. Audio EquivalenceA small c=1 audit (single prompt, single voice, n=4 WAVs per mode) checks that the streaming path preserves audio content versus non-streaming on the same backend.
Intelligibility is fully preserved; streaming and non-streaming are close but not bit-identical (expected from chunked-decode windowing). |
origin/main renamed the package mminf -> mstar and dropped the `new_tokens`
parameter from Model.get_partition_forward_pass_args. This ports the complete
Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series
— understanding + talker + config-tail/image-gen, steps 1-10) onto the current
main:
* moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and
rewrote all mminf -> mstar references (package + 23 modular tests + 2
configs + benchmark class).
* re-applied the two framework edits onto main's versions: the registry
entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that
skips graph walks whose nodes aren't in the deploy's node_groups.
* dropped the `new_tokens` arg from our get_partition_forward_pass_args
override to match the new base signature, and from the test call sites.
* fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is
now a registered submodule + partition, so the "unknown" assertions use a
genuinely unknown name).
356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures
are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not
port regressions.
The worker and conductor form a PUSH/PULL cycle: each drains its PULL in the same loop that issues PUSH sends. ZMQCommunicator.send used a blocking send_pyobj, so when a peer's receive buffer filled (large Ming-flash-omni output tensors under concurrent load), the send blocked, the caller stopped draining its own PULL, and the two peers deadlocked — server stalled at 0% GPU util with requests hung. Reproduced with 8 concurrent requests; py-spy showed the worker parked in zmq send inside _send_outputs. Make send non-blocking: raise SNDHWM, then send with zmq.NOBLOCK and, on zmq.Again, queue the message in a per-peer in-process deque flushed opportunistically (before each send, on every poll, and in wait_for_work). A full peer can no longer block the sender's drain loop, so the cycle can always make progress; FIFO order and delivery are preserved. After this, the same 8-concurrent pattern and the full benchmark sweep (470 requests) complete with 0 failures. Also widen the benchmark client's session timeout from a whole-session total=300 (which expired mid-sweep) to a per-request sock_read=120, so a slow-but-healthy request no longer spuriously fails the rest of the sweep. Adds test/modular/test_communicator_backpressure.py covering non-blocking send with no receiver and overflow->queue->ordered-delivery.
Adds an equal-split all-to-all collective to TPCommGroup: input (world_size * chunk, ...) is split into world_size equal chunks along dim 0, chunk i sent to rank i, output same shape. Equal split (no per-rank split sizes) keeps shapes static, which makes it CUDA-graph-capturable. This is the enabling primitive for capacity-padded expert-parallel (EP) MoE: EP dispatch/combine move tokens between expert-owning ranks via all-to-all, and the decode path runs under a captured CUDA graph, so the all-to-all must capture. The data-dependent token counts are handled by the caller padding to a fixed per-expert capacity so the split stays equal. Adds a 4-GPU distributed test (manual torchrun launch) covering equal-split correctness AND CUDA-graph capture+replay equivalence — the gating check for the EP design. Verified PASS on 4xH100.
The thinker's rope recomputed its cos/sin tables (plus the video_rope remap) inside every layer's attention via rotary(q, k, position_ids). cos/sin depend only on position_ids (and the static inv_freq), not on q/k, so for the 32-layer stack the tables were identical across all layers and recomputed 32x per forward. Split the rope into compute_cos_sin(position_ids) — called once in LingMoeModel.forward — and apply_cos_sin(q, k, cos, sin) — the cheap per-layer rotation. Thread the precomputed (cos, sin) down through decoder_layer -> attention; forward(q, k, position_ids) is kept as a back-compat wrapper. Output is bit-identical (verified max diff 0.0). Note: this does NOT change graph-captured decode throughput — that path already amortizes the recompute inside one fused CUDA-graph replay, and the decode step is latency-bound (MoE GEMM + TP all_reduce), not rope-compute-bound (profiled: rope ~33% of eager compute but the live decode step is dominated by other costs). The win is on the eager prefill path and reduced graph-capture cost; it's a redundant-work cleanup, not a decode speedup.
d5e25ae to
bed41c0
Compare
The thinker decode walk ran eager, one request at a time: can_batch
returned False, there was no forward_batched, and no get_cuda_graph_configs
(so warmup captured nothing). Single-stream decode was ~16.6 tok/s and
throughput was flat across concurrency (~0.15 req/s at every batch size).
Implement decode batching + CUDA graphs (the two are coupled — the engine's
BasicBatchedCudaGraphConfig always routes through forward_batched, even at
bs=1, so graphs require batching):
- get_cuda_graph_configs: declare a BasicBatchedCudaGraphConfig for
thinker_decode across bs=[1,2,4,8,16,32]. Prefill stays eager for now
(variable token counts need the FlashInferPacked path — follow-up).
- forward_batched: run the packed batch through LingMoeModel in one forward
(decode = 1 token/request, so logits are (bs, V), no last-token gather),
return per-rid {logits} + a __batched_logits__ sentinel so the engine /
graph runner samples the whole batch at once.
- can_batch: True for the decode walks only (prefill + multimodal-embeds
stay sequential).
- batched preprocess: pack input_ids + 3D positions across requests and
plan_attention with per-request seq_lens; drop the single-request
NotImplementedError.
- Graph-capturable positions: decode positions now flow
prepare_inputs -> preprocess -> forward as a static tensor instead of a
torch.arange rebuilt inside forward (the runner interns it and refreshes
it per replay). Emitted BATCH-FIRST as (total_tokens, 3) because the
static-buffer interning slices the leading dim and requires constant
trailing dims; _positions_for_model transposes back to (3, total_tokens)
for Ming's inline partial-3D rope. (qwen3_omni feeds decode through mrope
cos/sin buffers instead, so its pattern doesn't transfer directly.)
Verified on 4xH100 TP=4: decode graphs capture (0 decode graph misses),
correct output, single-stream ~110 tok/s (was 16.6), and 8 concurrent
requests complete in ~1.5s (previously serialized). Full T2T sweep: 0/470
failures, req/s scales 0.93 -> 4.29 across concurrency 1 -> 32.
BailingMoeV2ThinkerSubmodule.check_stop only stopped the thinker_decode_loop on EOS. Prompts where the model never emits EOS (list-style / open-ended) decoded until the client timeout, also degenerating into repetition. Add the max_tokens budget guard, mirroring qwen3_omni's ThinkerSubmodule.check_stop: stop when dynamic_loop_iter_counts[thinker_decode_loop]+1 >= max_tokens.
The engine manager passes the resolved autocast_dtype into get_submodule (landed in the general dtype-alloc fix). Make the Ming thinker honor it: cast the meta model to that dtype before to_empty so params allocate directly in the target dtype, falling back to get_autocast_dtype() for direct callers that pass None.
bed41c0 to
b18ddd6
Compare
ming_flash_omni: port to mstar + native serving optimizationsPorts Ming-flash-omni-2.0 to the M* (mstar) Walk-graph runtime and adds the serving-path optimizations that make the native path competitive with vLLM-omni at low/mid concurrency. Model portFull Ling-2.0 thinker + multimodal stack under Serving optimizations
Performance (4×H100 80GB, T2T sweep, thinker-only TP=4)
Native beats vLLM-omni through c=4 (decode graphs win at low concurrency) and is competitive at c=8; the high-concurrency gap is structural (balanced decode step, no single crushable bottleneck on 4-GPU single-node). 0/470 request failures. Notes
Test plan
|
origin/main renamed the package mminf -> mstar and dropped the
new_tokensparameter from Model.get_partition_forward_pass_args. This ports the complete Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series — understanding + talker + config-tail/image-gen, steps 1-10) onto the current main:new_tokensarg from our get_partition_forward_pass_args override to match the new base signature, and from the test call sites.356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not port regressions.
What does this PR do?
How was it tested?
Checklist
ruff check .passes