[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill) by zhudianGG · Pull Request #104 · mstar-project/mstar

zhudianGG · 2026-06-10T07:55:01Z

PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)

Base: noah_model_support · Head: noah_ming_understanding · 8 commits
Compare: noah_model_support...noah_ming_understanding

Benchmark (runnable today): * benchmark/base.py: MingFlashOmni model (inclusionAI/Ming-flash-omni-2.0, all 8 omni modalities T2T/I2T/A2T/V2T + T2S/I2S/A2S/V2S, max_tokens=256 for cross-system fairness, no system preamble) + ModelType.MING_FLASH_OMNI. * benchmark/vllm_omni_instructions.md: launch commands for vllm-omni's ming_flash_omni{,_thinker_only,_tts} deploy yamls. * Benchmarks Ming today via --inference-system vllm_omni against a vllm-omni server. Native mminf port (scaffold only — every abstractmethod raises NotImplementedError; mminf-serve will fail at startup until filled in): * mminf/model/ming_omni_flash/{config,ming_omni_flash_model}.py: file/class shape mirroring mminf/model/qwen3_omni/ with pointers to the upstream vllm-omni reference (~6,500 LOC). * mminf/model/ming_omni_flash/PORTING_NOTES.md: 12-step punch list mapping each mminf surface to the upstream vllm-omni file + closest Qwen3-Omni parallel. * mminf/model/registry.py: registered under "ming_flash_omni" with HF id. * configs/ming_flash_omni{,_thinker_only}.yaml: starter deploy topologies mirroring vllm-omni's, marked WIP.

Step 1 of mminf/model/ming_omni_flash/PORTING_NOTES.md. Replaces the placeholder config.py with a full dataclass tree mirroring mminf/model/qwen3_omni/config.py: ThinkerLLMConfig (Ling-2.0 256-expert MoE, head_dim=128, partial_rotary_factor=0.5, mrope_section=[8,12,12]), VisionEncoderConfig (Qwen3-MoE ViT 27L, out_hidden=4096), AudioEncoderConfig (Whisper 32L with Ming-side ds_kernel/ds_stride/ norm_query knobs), plus skeleton TalkerConfig + ImageGenConfig that lazy-load from the released checkpoint's sibling subdirs (talker/{config,llm,vae}.json, transformer/, mlp/, etc.) — those two get full field semantics at steps 6 and 9. The released ckpt does NOT match upstream vllm-omni's flat MingFlashOmniConfig nesting; top-level config.json is the BailingMM2Config shape only, so the loader walks subdirs instead of parsing a single nested dict. __post_init__ sanity checks fail loudly on the silent-miswire patterns (head_dim inconsistency, MRoPE section that doesn't partition the rotary cos/sin half, multimodal token IDs outside vocab). MingFlashOmniModel.__init__ now resolves the snapshot and loads the config before raising NotImplementedError, so the load path is exercised end-to-end even though no submodules / graph walks exist yet (those are steps 3+). Verified: pytest test/modular/test_ming_flash_omni_config.py passes 10/10 against the released checkpoint locally cached at ~/.cache/huggingface/hub/models--inclusionAI--Ming-flash-omni-2.0/; tests skip cleanly when the snapshot isn't present.

Step 2 of mminf/model/ming_omni_flash/PORTING_NOTES.md. The released HF checkpoint ships only weights + sub-dir configs — none of the tokenizer / processor / modeling Python modules that AutoTokenizer and AutoProcessor's trust_remote_code path expects to find next to config.json. Those live in the Ming source repo at https://github.com/inclusionAI/Ming . This commit: * Adds _prepare_tokenizer_dir + _find_ming_code_dir helpers that symlink the required Ming .py and .json assets from a separately cloned source repo (located via MING_CODE_DIR env, ./Ming, or /tmp/ming_repo) into the snapshot dir, and push the snapshot onto sys.path so transformers' dynamic-module loader's sibling imports resolve. * Loads BailingTokenizer + BailingMM2Processor with graceful fallback: when the source repo or its extra deps are missing, init logs a clear how-to-fix warning and leaves self.tokenizer / self._processor as None instead of crashing. * Documents the Ming source dependency + setup steps in PORTING_NOTES.md. Also corrects the benchmark/base.py:MingFlashOmni docstring on role mapping: it previously claimed BailingMM2Processor maps OpenAI roles, but BailingMM2Processor is strict and rejects user/assistant. What actually happens is the *jinja* chat_template in tokenizer_config.json does the remap. vllm-omni serves via tokenizer.apply_chat_template (which uses the jinja), so the benchmark wire format is correct; the native mminf process_prompt (step 7) will need to remap roles before invoking BailingMM2Processor.apply_chat_template. Verified: 11 new tests in test/modular/test_ming_flash_omni_tokenizer.py pass against the released ckpt + a clone of inclusionAI/Ming at /tmp/ming_repo. All 21 ming tests skip cleanly when either the snapshot or the source repo is absent. Ruff clean.

The released inclusionAI/Ming-flash-omni-2.0 doesn't load straight into vllm-omni: the snapshot ships BailingMM2-flavoured processor configs and talker weights with an `audio.*` prefix, while vllm-omni's MingFlashOmniForConditionalGeneration registers Qwen2VLImageProcessor + MingWhisperFeatureExtractor and expects `audio_vae.*` for the talker. The fix is to build a hybrid snapshot — inclusionAI's thinker safetensors (the only heavy bit, ~200 GB) plus Jonathan1909's repackaged metadata files + talker weights (~3 GB extra). This avoids re-downloading the thinker. Adds the explicit launch + benchmark recipe to benchmark/vllm_omni_instructions.md, including the served-model-id quirk (vllm-omni reports the local serve path verbatim and 404s on the canonical HF id) and a results table from a local 4×H100 run on 2026-06-06: T2T offline B=1: 110 tok/s T2T closed-loop C=8: 493 tok/s T2S: RTF 0.14 (real-time factor; <1 = faster than real-time) I2T + A2T both validated end-to-end.

Adds results/ming_t2t_sweep/SUMMARY.md with the throughput curve from a 6-point concurrency sweep on 4×H100 against the running vllm-omni hybrid-snapshot Ming server: c=1 → 110 tok/s (single-stream baseline) c=2 → 199 tok/s (1.8×) c=4 → 356 tok/s (3.2×) c=8 → 573 tok/s (5.2×) c=16 → 888 tok/s (8.1×) c=32 → 1060 tok/s (9.6×; knee here) All 470 requests across the sweep succeeded; TTFT stays 28-91 ms. benchmark/vllm_omni_instructions.md: expand the modalities-exercised table from the 4 modalities run in the previous session (T2T/I2T/A2T/ T2S) to all 8 omni paths (adds V2T/V2S/I2S/A2S, all green). Documents the direct-OpenAI-path workaround for V2T/V2S/A2S, used to sidestep UCF101 + LibriSpeech dataset downloads when disk is full.

Two small-N quality checks against the same vllm-omni Ming server used for the throughput sweep: MMLU 78.9% accuracy on 285 items (cais/mmlu, ~5 per subject, 0-shot) VideoMME 56.9% accuracy on 51 items (chunk1 subset, stratified by duration, 0-shot) Both at temperature=0, parse rates ≥99%. MMLU runs in 13s (~22 req/s, text-only); VideoMME takes ~10 min wall (~11 s/req, base64-inlined mp4s). ACCURACY.md ships the per-subject (worst/best 10) and per-task-type breakdowns. Notable: VideoMME medium-duration accuracy (29%) is much lower than short (77%) or long (65%) — likely sample variance at N=17/ bucket, but flagged. Temporal Reasoning subtype 0/3 is also worth a larger-sample follow-up. These are spot checks, not publishable numbers; caveats are inlined in ACCURACY.md. Per-item results.json files (gitignored) sit beside it locally for drill-down.

Step 3a of mminf/model/ming_omni_flash/PORTING_NOTES.md. Adds the three architecture-specific pieces of the Ling-2.0 thinker that don't map cleanly onto mminf's existing components/, ahead of assembling the full BailingMoeV2 decoder layer in step 3b: components/router.py — LingMoeRouter: * sigmoid + learned (non-grad) expert bias + group-limited top-k (n_group=8 groups, topk_group=4) + routed_scaling_factor * returns (logits, weights, indices) tuple so it drops straight into mminf's SparseMoeBlockWithSharedExpert + the fused-Triton dispatch components/rope.py — LingPartialMRotaryEmbedding: * partial rotary (head_dim * 0.5 dims rotated, rest pass-through) * 3D video_rope cos/sin remap [H W H W ... T T T] — the unusual interleaving Ming uses instead of standard MRoPE's contiguous [T T H H W W] layout * degenerates to plain 1D rotary on 1D position_ids components/attention.py — LingAttention: * per-head RMSNorm on q and k before rope (use_qk_norm: True on the released ckpt — standard ParallelAttention doesn't bake this in) * composes the rope module + GQA + causal SDPA * step-3a scope is batch=1 unit-test; full TP path lands step 3b test/modular/test_ming_flash_omni_components.py — 12 tests: * router: shapes/scaling, group-limit isolation, expert-bias shift, bad-config rejection, vllm-omni indices cross-check (skip when vllm-omni not importable in venv) * rope: shapes + pass-through, 1D = plain rotary, video_rope axis assignment (zero-row sentinel test), inconsistent-section rejection * attention: forward runs (CUDA only — mminf RMSNorm uses flashinfer's CUDA kernel), QK-norm produces unit-RMS output, causal mask doesn't leak future tokens Result: 11 component tests pass + 21 existing config/tokenizer tests still green (32 total Ming tests). vllm-omni cross-check skips cleanly in mminf's venv (vllm_omni is only installed in the vllm venv) and when run manually requires a vllm config context that's non-trivial to bootstrap outside vllm's own test harness. Out of scope: BailingMoeV2DecoderLayer (hybrid dense/MoE per first_k_dense_replace) — step 3b. BailingMoeV2Model + weight loader + mminf submodule wiring — step 3c.

Step 3b of mminf/model/ming_omni_flash/PORTING_NOTES.md. Assembles the step-3a components (LingMoeRouter, LingPartialMRotaryEmbedding, LingAttention) into the layer and full-thinker forward. Real find while reading upstream: Ling's MultiRouter isn't a single grouped-topk router — it's THREE routers (text gate, image_gate, audio_gate) mixed per-token by image/audio modality masks. LingMoeRouter from step 3a is correct as the per-router primitive; this step adds the multi-router composition around it. components/moe.py — LingMoeBlock: * 3 LingMoeRouter instances (gate / image_gate / audio_gate) * Fused expert weights matching mminf SparseMoeBlock's packed layout (gate_up_proj, down_proj) — step-3c weight loader can reuse the existing primitives * GatedMLP shared expert of moe_intermediate_size * num_shared_experts width; output is added unconditionally and ungated (matches upstream — no shared_expert_gate sigmoid trick) * forward(hidden, image_mask=None, audio_mask=None): text gate runs always, image/audio gates run + torch.where-swap their picks at masked positions components/decoder_layer.py — LingDecoderLayer: * pre-norm pattern (RMSNorm + LingAttention + residual) * branches on layer_idx: GatedMLP (intermediate_size=9216) when layer_idx < first_k_dense_replace, else LingMoeBlock * threads image_mask/audio_mask only to the MoE branch components/model.py — LingMoeModel: * Embed + ModuleList of N LingDecoderLayer + RMSNorm + lm_head * Single shared LingPartialMRotaryEmbedding instance across layers * forward accepts input_ids OR input_embeds (multimodal callers will splice vision/audio embeds in step 4+), returns (T, vocab_size) logits — no last-position slicing here test/modular/test_ming_flash_omni_model.py — 9 tests: * MoE block: text-only shape, image mask routes through image_gate, shared expert contributes, bad-mask-shape rejection * Model: input_ids/embeds XOR contract; full forward shape; embed bypass; dense-vs-MoE layer-index branch differs; end-to-end causal 41 of 42 Ming tests passing (1 skipped: vllm-omni cross-check needs vllm-omni in mminf venv; step 3a). Lint clean. Out of scope (step 3c): - KV cache wiring on LingAttention - Safetensors weight loader (per-expert gate/up/down fusion across 256 separate keys into the packed gate_up_proj param) - BailingMoeV2ThinkerSubmodule wrapping LingMoeModel for mminf's engine/graph-walk machinery - Real-checkpoint smoke test (load shard 1, run forward, verify finite outputs against vllm-omni's output) - TP-aware ParallelAttention/ParallelMoeBlock variants

Step 3c of mminf/model/ming_omni_flash/PORTING_NOTES.md. Maps the released inclusionAI/Ming-flash-omni-2.0 checkpoint into the LingMoeModel built in steps 3a + 3b, and verifies the load + forward end-to-end against the real shards. loader.py: * _RENAME_RULES — 18 patterns mapping the ckpt's HF naming convention (model.model.layers.{i}.attention.query_key_value.weight, .mlp.gate.weight, .mlp.experts.{j}.gate_proj.weight, etc.) into LingMoeModel's state_dict names (layers.{i}.self_attn.qkv_proj.weight, .mlp.gate.gate.weight, .mlp.experts.gate_up_proj after fusion). * build_ling_weight_converters() — reuses mminf's existing MergeModulelist + Concatenate Operations to pack 256 per-expert gate_proj/up_proj/down_proj weights per MoE layer into the dense (256, 2*moe_inter, hidden) and (256, hidden, moe_inter) tensors LingMoeBlock expects. * load_thinker_weights(model, local_dir, device, strict=True) — iterates shards via iter_safetensors_shards, applies the rename pass, buckets per-expert weights per layer, runs the fusion converters, and assigns to model.state_dict. Strict mode raises on missing target params or unmatched ckpt keys; non-strict skips. __init__.py — re-exports LingMoeModel and load_thinker_weights so external callers can `from mminf.model.ming_omni_flash import ...` without crawling into components/. test_ming_flash_omni_loader.py — 6 tests: * Pure-Python (always run): rename rules cover layer-0 dense keys, rename rules cover MoE-layer keys, expert fusion produces correctly-packed (256, 2*inter, hidden) tensor with gate/up halves in expected positions, strict mode raises on missing params. * Real-ckpt (CUDA + snapshot gated): load embed + dense layer 0 + norm + lm_head from the released shards (~3 GB) into a 1-layer LingMoeModel; forward 4 token ids returns (4, 157184) finite bf16 logits. Second test verifies every layer-0 attention parameter has the expected shape after load. 49 of 50 Ming tests passing (1 skipped: vllm-omni router cross-check needs vllm-omni in mminf venv; step 3a). Real-ckpt smoke confirms the model-side code matches the upstream architecture: random tokens → finite logits after embed + 1 dense transformer layer + lm_head, with 1024-dim packed QKV correctly split into Q (32×128) / K (4×128) / V (4×128), and SDPA running on bf16 weights. Out of scope (step 3d): - KV cache wiring on LingAttention (currently uses inline SDPA; needs mminf's cache_handle plumbing) - BailingMoeV2ThinkerSubmodule in submodules.py — wraps LingMoeModel into mminf's ARNodeSubmodule interface so the engine can drive it - Full multi-layer forward verification against a vllm-omni-served reference (the "byte-equality with upstream" test — needs all 32 layers loaded across multiple GPUs) - TP-aware variants (ParallelAttention / ParallelMoeBlock + a TP-rank-aware weight loader)

… (step 3d) Step 3d of mminf/model/ming_omni_flash/PORTING_NOTES.md. Connects the LingMoeModel built in 3a-3c to mminf's engine: wires KV cache through attention, adds the submodule the engine calls, fills in every MingFlashOmniModel ABC method for the text-only path. components/attention.py — LingAttention now calls cache_handle.run_attention(q, k, v) (paged KV write + masked SDPA via FlashInfer) instead of inline F.scaled_dot_product_attention. Keeps the custom partial-3D video_rope rotation inline (we don't use cache_handle.apply_rope). Forward signature is now packed-tokens (num_tokens, hidden) + cache_handle + position_ids — the layout the mminf engine actually uses. components/decoder_layer.py + components/model.py — thread cache_handle through to attention; LingMoeModel.forward calls cache_handle.set_layer_idx(i) before each layer's forward. cache_handle is the new first positional arg of model.forward (everything after stays kwarg). submodules.py (new) — BailingMoeV2ThinkerSubmodule wraps LingMoeModel into mminf's ARNodeSubmodule contract: prepare_inputs builds ARNodeInputs from token ids; preprocess plans the cache + packs the batch (single-request only in 3d); forward runs the LingMoeModel + advance_seq_lens; check_stop returns {"decode_loop"} when the sampled token is <|role_end|> (id 156895). Mirrors Orpheus's text-LLM template closely. ming_omni_flash_model.py — removed the raise-NotImplementedError that made the scaffold un-instantiable; implemented every Model ABC method for the thinker text-only path: get_kv_cache_config (Ling-2.0 dims from config.thinker_llm), get_node_engine_types ({"Thinker": KV_CACHE}), get_graph_walk_graphs (prefill + decode_loop), get_partition_topology (single Thinker partition), get_initial_forward_pass_args + get_partition_forward_pass_args (mirrors Orpheus's prefill→decode→done flow), process_prompt (jinja chat_template with the model's tokenizer — OpenAI-standard "user" role works), postprocess (decode tokens to utf-8), get_submodule (builds LingMoeModel + calls load_thinker_weights + returns BailingMoeV2ThinkerSubmodule). configs/ming_flash_omni_thinker_only.yaml — simplified to register only the Thinker node (audio_encoder/vision_encoder lands at step 4+). Single-rank by default — TP=4 needs step-3e TP-aware variants. Tests (test_ming_flash_omni_{components,model,loader}.py) — updated to pass a _MockCacheHandle through every forward call. The mock implements set_layer_idx + run_attention(SDPA-based) — the same behavior the inline path had before the refactor, so test semantics are unchanged. Real-ckpt smoke (step 3c's layer-0 forward through the embed + 1 dense layer + lm_head) still produces finite bf16 logits with the new signature. End-to-end mminf-serve smoke (substep 4): mminf-serve --config ming_flash_omni_thinker_only.yaml --tensor-comm-protocol SHM successfully starts uvicorn, instantiates MingFlashOmniModel, calls get_submodule("Thinker"), and starts loading weights via load_thinker_weights — failing with OOM after ~75 GB on a single 80 GB H100. This is the expected blocker without TP-aware code: the full 100B-param model needs TP=4 across 4 GPUs to fit. The engine plumbing itself works end-to-end; step 3e (TP-aware ParallelAttention / ParallelMoeBlock + TP-rank-aware weight loader) is the remaining piece for actual serving. 47 of 48 Ming tests pass (1 skipped: vllm-omni router cross-check needs vllm-omni in mminf venv from step 3a). Lint clean.

Step 3e of mminf/model/ming_omni_flash/PORTING_NOTES.md. Makes the LingMoeModel TP-aware so the full 100B-param model actually fits across multiple H100s (single-GPU OOMed at 75 GB in step 3d's smoke). components/attention.py — LingAttention now wraps mminf's QKVParallelLinear (per-rank head sharding, weight_loader handles "q"/"k"/"v" shard_ids) + RowParallelLinear (all-reduces output dim). Per-rank num_heads / num_kv_heads come from the qkv_proj after construction. QK-norm + partial-3D video_rope stay inline (head_dim- shaped operations identical at every rank). components/moe.py — LingMoeBlock now allocates expert tensors with shard_inter = moe_intermediate_size // tp_size, attaches mminf's existing _gate_up_weight_loader / _down_proj_weight_loader (per-rank slicing along the intermediate dim, shard_ids "gate:N"/"up:N"/"down:N" per-expert). Shared expert becomes ParallelGatedMLP (its down_proj all-reduces internally). TP>1 forward mirrors ParallelSparseMoeBlock._dispatch_tp: fused_experts(reduce_results=False) + comm_group.all_reduce + moe_sum_reduce_triton. components/decoder_layer.py + components/model.py — comm_group plumbed through every constructor. Dense layer-0 MLP becomes ParallelGatedMLP. loader.py — full refactor onto mminf's load_hf_weights + StackedParamRule machinery (replaces step 3c's custom loader). New shape: * _strip_outer_model_prefix + _apply_substring_renames + per-expert __expertN__ marker rewrite in _remap_thinker_keys * _split_packed_qkv splits the ckpt's packed query_key_value.weight into three synthetic q_proj/k_proj/v_proj entries, which the standard q/k/v StackedParamRules route into QKVParallelLinear's fused qkv_proj * _build_thinker_stacked_params dynamically builds 3 × num_experts rules + dense MLP gate/up + synthetic QKV rules (770 total for Ling-2.0's 256 experts) Per-rank weight slicing is automatic via the parameter-attached weight_loaders on every Parallel* module. ming_omni_flash_model.py — _create_thinker_submodule (no longer in inline get_submodule) constructs LingMoeModel(comm_group=tp_group) on the meta device, .to_empty(device=device).to(bf16), then loads via load_thinker_weights. get_default_sharding_config declares Thinker as TP-capable. configs/ming_flash_omni_thinker_only.yaml: tp_size=8 on GPUs 0-7 (TP=4 hit OOM at 78.58/80 GB; TP=8 has plenty of headroom). Tests: * components/model tests: switched to _init_dispatch_weights helper that initialises every Parallel* param the constructor allocated (Parallel* modules use torch.empty for params; real weight loading overwrites them in production, tests need explicit init). * test_ming_flash_omni_loader.py: rewritten for the new helpers (_remap_thinker_keys, _build_thinker_stacked_params, _split_packed_qkv). Real-ckpt smoke loads embed + 1 dense layer + norm + lm_head and runs a forward — 1 layer's worth of finite bf16 logits at vocab=157184. 47 of 48 Ming tests pass (1 skipped: vllm-omni router cross-check). Lint clean. End-to-end mminf-serve smoke (TP=8 on 8 H100s): ✅ uvicorn starts on :8092 ✅ All 8 workers load 507 thinker params each (~50 sec total) ✅ KVCacheEngine warmup_and_capture + torch.compile applied ✅ Dedicated GPU threads + plan_executor spin up ❌ First /generate request: IndexError in BailingMoeV2ThinkerSubmodule.prepare_inputs — per-request text_inputs list arrives empty. Integration bug between get_initial_forward_pass_args / graph walks / the conductor's prompt-to-input-signals routing, NOT a model code bug. All the heavy plumbing works; needs a small follow-up to wire the prompt tokens through to the first prefill call. Documented in PORTING_NOTES.md. Out of scope (step 3f and step 4+): - Fix the text_inputs-routing for the first prefill call (small but needs a debug session walking the conductor → worker dispatch path) - Multi-request batching in BailingMoeV2ThinkerSubmodule - Vision / audio encoders + their prefill walks - Talker / AudioVAE / image-gen

Closes two items from the mminf↔vllm-omni correctness review: * Add a parametrised numeric parity test for ``LingPartialMRotaryEmbedding._remap_video_rope`` vs vllm-omni's ``MingVideoRopeMRotaryEmbedding._remap_video_rope``. mminf operates on the full ``(3, T, rotary_dim)`` neox-cat table while vllm operates on the ``(3, T, rotary_dim/2)`` half table; both halves of our output must equal vllm's half output. 6 cases cover the released ckpt geometry (mrope_section=[8,12,12]) plus edges where hw_size==half (no temporal tail), hw_size<<half, and asymmetric Nh≠Nw. * Add the missing multimodal token IDs (``audio_patch_token``, ``audio_start_token``, ``audio_end_token``, ``image_end_token``, ``video_end_token``) and ``tokens_per_second`` to ``ThinkerLLMConfig`` with tokenizer-truth defaults. Without them, the vision/audio masking + MRoPE temporal-position pipeline (porting step 4) has nowhere to read these constants from. Also repair an upstream mislabel found while wiring those defaults: the inclusionAI ckpt's ``llm_config.video_start_token`` is 157159, but per the tokenizer 157159 is ``</image>`` and the real ``<video>`` token is 157160. Jonathan1909's patched config and vllm-omni's hardcoded default both have 157160. ``__post_init__`` now detects the bogus value, repairs it in place, and warns loudly so a future ckpt that intentionally rebinds the field doesn't get silently overridden. Extend the vocab-bounds validator to cover the five newly-added token fields and add regression tests for both behaviours. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two model-side bugs blocked the first end-to-end /generate response on top of step 3e: (a) BailingMoeV2ThinkerSubmodule had no postprocess hook, so the decode loop's text_inputs edge never received the freshly sampled new_token. Added postprocess that rebinds new_token -> text_inputs, mirroring OrpheusLLMSubmodule. (b) Prefill / decode output edges used EMPTY_DESTINATION + conductor_new_token=True. With (a) fixed the loop produced tokens but the API server received {"outputs": {}} because no edge routed new_token to the client. Switched to Qwen3-Omni's pattern: emit each token via parallel EMIT_TO_CLIENT (output_modality="text") edges alongside the text_inputs loopback. Also collected environment-side patches required to actually reach a working forward on this box: * BailingTokenizer doesn't load under transformers >= 5.0 (verbose removed; add_bos_token setter touches not-yet-built _tokenizer). _patch_bailing_tokenizer_for_transformers5 applies both fixes lazily after the first AttributeError. * LingMoeBlock._dispatch_tp now falls back to dispatch_experts_fused + all-reduce when sgl_kernel is unloadable, which is the case here due to an ABI mismatch against the installed torch. Math is equivalent (sum-over-TP and sum-over-top-k commute). Verified via mminf-serve smoke (TP=8 on 8 H100s): /generate returns real model text. Updated configs/ming_flash_omni_thinker_only.yaml comments and PORTING_NOTES.md punch list accordingly.

Three new stateless modules with weight-key parity against the released ckpt's top-level prefixes: * MingVisionProjector / MingAudioProjector (components/projectors.py): Port the nn.Sequential chains built inline in modeling_bailingmm2.py into standalone modules. Layer indices match the on-disk keys (linear_proj.{0,2} for vision, linear_proj_audio.{0,3} for audio). * build_vision_encoder (components/vision_encoder.py): Construct Ming's Qwen3MoeVisionTransformer via dynamic import from the staged Ming source dir (the same path used by the tokenizer + processor). The encoder is ~1 GB at bf16 and runs on a single GPU, so we use the reference implementation directly rather than fork. * MingAudioEncoder (components/audio_encoder.py): Self-contained port of vllm-omni's packed-sequence Whisper encoder (~250 LOC). No openai-whisper runtime dep — optional flash-attn varlen fast path with a manual padded-attention fallback. Param names match upstream Whisper (query/key/value/out, mlp.{0,2}.{weight,bias}) so the released ckpt's audio.blocks.N.* keys load by state-dict equality. 17 tests in test/modular/test_ming_flash_omni_encoders.py: 12 pure-Python (projector shapes/indices/forward, audio encoder weight-key parity, packed-attention fallback) + 1 snapshot-gated (vision encoder builds from real VisionEncoderConfig) + 1 CUDA-gated (forward smoke under eager attention, currently skipped on this box for missing libnvrtc-builtins — not a code bug; will re-verify when step 5 wires encoders into the prefill walk). PORTING_NOTES step 4a updated; 4b (extend loader.py to actually load the vision/audio/projector subtrees from the snapshot) is the next sub-step before the encoders can be wired into a live graph walk.

Adds four loader entry points on top of a shared _load_prefixed_state_dict helper: * load_vision_encoder_weights (prefix=vision.) * load_audio_encoder_weights (prefix=audio.) * load_vision_projector_weights (prefix=linear_proj., inner=proj.) * load_audio_projector_weights (prefix=linear_proj_audio., inner=proj.) None of these are TP-aware — vision + audio encoders colocate on rank 0 in the typical topology (see configs/ming_flash_omni.yaml), so plain prefix-strip + load_state_dict suffices. The projector loaders prepend `proj.` so the on-disk linear_proj.{0,2}.* and linear_proj_audio.{0,3}.* keys hit the nn.Sequential slot by integer index. Verified by 4 snapshot-gated tests against /dev/shm/ming-hybrid: all four prefixes load strictly (no missing / unexpected keys). The audio encoder's positional_embedding is loaded as a buffer (overrides the local sinusoidal init); the vision encoder loads all 27 blocks + merger + deepstack_merger_list cleanly. Snapshot lookup in the test helper now prefers /dev/shm/ming-hybrid (merged shards + index) over the HF-Hub snapshot dir (which only has the index symlink — shards live elsewhere on this box). Step 4a + 4b complete; step 5 (wire encoders into prefill graph walks) is the next slice.

…n (step 5a) Add the two encoder NodeSubmodules and their construction paths so the Thinker can pull vision/audio embeddings off graph nodes once step 5b/5c land the prefill walks. * VisionEncoderSubmodule wraps Qwen3MoeVisionTransformer + MingVisionProjector and mirrors modeling_bailingmm2.extract_image_feature (encoder → projector → F.normalize). prepare_inputs raises clearly on missing pixel_values / image_grid_thw and promotes 1-D [T, H, W] grid_thw to (1, 3). * AudioEncoderSubmodule wraps MingAudioEncoder + MingAudioProjector. Accepts a single (n_mels, T) clip or (B, n_mels, T) batched tensor, optionally trims the padded tail using audio_seqlens, and concatenates per-clip embeddings along time. L2-norm applies when audio_config.norm_query_embeds is set (true on the released ckpt — matches modeling_bailingmm2.extract_audio_feature). * get_node_engine_types now registers vision_encoder and audio_encoder as EngineType.STATELESS alongside the KV-cache Thinker. Construction routes through _create_vision_encoder_submodule / _create_audio_encoder_submodule helpers that build, dtype-cast, and weight-load via the loaders from step 4b. flash_attention_2 is the default for the vision encoder (override via MING_VISION_ATTN_IMPL env var for non-FA2 dev boxes); audio encoder uses flash-attn varlen when available, manual fallback otherwise. 12 tests in test/modular/test_ming_flash_omni_submodules.py: 10 pure-Python (input validation, output shape, L2 norm, batched/single equivalence, audio_seqlens trim, grid_thw promotion, node-type registration, friendly error on unknown node) + 2 snapshot-gated (_create_audio_encoder_submodule end-to-end on the real ckpt — verifies Conv1 + projector params are non-zero post-load). PORTING_NOTES step 5 broken out into 5a (this), 5b (Thinker prefill dispatch for vision/audio modality routing), 5c (graph walks + partition wiring + initial-forward-pass arg routing).

…osition helpers (step 5b) BailingMoeV2ThinkerSubmodule.prepare_inputs now dispatches on graph_walk and emits either input_ids (text-only walks) or input_embeds + custom_pos_ids (multimodal walks). preprocess and forward route both shapes through to LingMoeModel's existing dual input_ids/input_embeds + 1D/3D position_ids handling — no new model.py path needed. Three new position-id helpers live in components/positions.py, each producing (3, T) long tensors compatible with LingPartialMRotaryEmbedding's video_rope branch: * get_rope_index_text — three identical sequential rows. Pure-text branch of modeling_bailing_moe_v2.get_rope_index (:658-675). * get_rope_index_audio — alias to text (Ming does not special-case audio in get_rope_index). * get_rope_index_vision — per-image 3D grid math from :625-647 with optional video timestamp scaling via second_per_grid_t * tokens_per_second. Thinker dispatch covers: * prefill / prefill_text — backward-compat text path (unchanged). * prefill_audio — wraps audio_embeds with audio_start / audio_end sentinel embeds, text-like 3D positions for the span. * prefill_vision / prefill_video — wraps vision_embeds with image_start/image_end (or video_start/video_end), grid-aware 3D positions. eos sentinel sits at global_max(vision_pos) + 1 so the next walk's text positions resume without collision (matches llm_pos_ids_list[-1].max() + 1 in the source). * decode / thinker_decode — single-token AR step (unchanged). Sentinel embeds are lazily computed per device on first use; the Thinker submodule now takes config= at construction so it can read vision.spatial_merge_size, thinker_llm.tokens_per_second, and the *_start_token / *_end_token ids. ming_omni_flash_model.py threads self.config through to the submodule. Step 5b restricts to single-image / single-clip requests; the multi-image splice via Sequential graph wiring lands in 5c. 21 new tests across test_ming_flash_omni_positions.py (11) and test_ming_flash_omni_submodules.py (10): position-id shape / offset / abs-time math, missing-input error paths, multi-image rejection, sentinel embed correctness for audio / image / video walks, start_pos advancement, legacy prefill walk name compat. All green.

get_graph_walk_graphs now returns five walks instead of the step 3f text-only prefill/decode pair: * prefill_text — bare Thinker node. * prefill_audio — Sequential([audio_encoder, Thinker]); encoder emits audio_embeds into the Thinker. * prefill_vision — Sequential([vision_encoder, Thinker]); image_grid_thw routes to BOTH the encoder (for spatial positions on the patches) AND the Thinker (for 3D MRoPE math around the vision span). * prefill_video — same shape as prefill_vision plus video_second_per_grid routed into the Thinker. * thinker_decode — AR loop, renamed from step 3f's decode. get_partitions lists all five walks under the single Thinker partition with initial_walk="prefill_text". Two new helpers drive scheduling: * _build_thinker_prefill_schedule(input_modalities, input_signals) — one schedule step per modality, in input_modalities order; each step is (walk_name, {input_name: TensorPointerInfo}). Modalities listed without matching tensors in input_signals are silently skipped (parity with qwen3_omni). * _get_thinker_prefill_inputs(metadata, input_signals) — emits one GraphEdge per input for the current step, routing each to the right node (encoder vs Thinker), including the dual image_grid_thw edge for vision walks. get_initial_forward_pass_args builds the schedule, picks the first walk, and stashes the schedule + step counter on the metadata. get_partition_forward_pass_args is the Thinker state machine: advance schedule → transition to thinker_decode → return request_done=True after the decode loop unwinds. Mirrors qwen3_omni_model.py:765+ minus the Talker / Code2Wav partitions. Empty-schedule edge case (no usable modalities) short-circuits to request_done=True so the conductor doesn't hang. 21 tests in test/modular/test_ming_flash_omni_graph.py covering walk structure, partition listing, schedule construction for all modality mixes (incl. unknown-modality / no-inputs), per-walk edge routing, and full state-machine drive across a text+audio request (init → audio prefill → decode → done). The submodule's backward-compat aliases for "prefill"/"decode" stay in place so external callers that still emit the step 3f walk names keep working.

… video (step 7) MingFlashOmniModel.process_prompt now produces the full NameToTensorList consumed by step 5c's prefill scheduler. Strategy mirrors qwen3_omni's process_prompt: apply the chat template to TEXT-ONLY messages (so the tokenizer doesn't insert placeholder tokens we'd later have to strip), then run image / video / audio sub-processors separately for each modality. Uses tokenizer.apply_chat_template (jinja, accepts OpenAI user/assistant/system roles) rather than the stricter processor.apply_chat_template (asserts on uppercase HUMAN/ASSISTANT only) — keeps the API surface OpenAI-compatible. Inputs (tensors: NameToTensorList): * image_inputs — list of CHW float [0,1] tensors per image. The internal _image_to_processor_input converts to HWC uint8 to avoid the upstream's double-rescale-to-zero bug. Single-channel inputs auto-broadcast to 3 channels. * audio_inputs — raw 1-D float tensors OR (waveform, sample_rate) tuples (sample rate inferred from processor default 16 kHz when raw waveform is passed). * video_inputs — list of (T, C, H, W) float tensors. Per-frame second_per_grid defaults to 1.0; override via kwargs["input_metadata"]["video"][i]["second_per_grid"]. Outputs (keys consumed by _build_thinker_prefill_schedule): * text_inputs — list of 1-D long tensors per text turn. * pixel_values, image_grid_thw — one entry per image. * pixel_values_videos, video_grid_thw, video_second_per_grid — per video clip. * audio_features (n_mels, T), audio_seqlens (length-1 long) — per audio clip. Upstream returns (B, T, n_mels); we transpose to (n_mels, T) per clip so AudioEncoderSubmodule.prepare_inputs can splice without a reshape. 17 tests in test/modular/test_ming_flash_omni_process_prompt.py covering text-only / no-prompt / image / audio / video / mixed paths, per-modality dispatch, missing-processor error paths, CHW-float→HWC-uint8 conversion correctness (including grayscale + uint8 pass-through), multi-image, video metadata override, plus a snapshot-gated text+image end-to-end against the real BailingMM2Processor. 16 green + 1 env-skip on this box. Image-gen <image><imagePatch>*256</image> query-token block deferred to step 9 (ImageGen partition; text-out generation works without it).

…groups Found during the first live mminf-serve bring-up of Ming-flash-omni (thinker-only config, TP=4 on GPUs 4-7). get_worker_graphs iterated EVERY graph walk, including prefill_audio / prefill_vision / prefill_video / talker, which reference encoder / talker nodes (audio_encoder, vision_encoder, Talker). The thinker-only deploy only declares `Thinker` in node_groups, so _divide_into_worker_graphs hit KeyError: 'audio_encoder' while dividing the prefill_audio walk and crashed conductor startup. Fix: in get_worker_graphs, collect the node names a walk references via graph.get_nodes() and skip any walk whose required nodes aren't all present in the config's node_groups. A partial deploy (thinker-only, talker-only, etc.) simply can't serve the walks for nodes it doesn't host — that's correct behaviour, not an error. This is generic framework behaviour (any model with optional partitions benefits), not Ming-specific. Verified: thinker-only conductor startup now proceeds past worker-graph division to weight loading (then OOMs at the documented TP=4 ~78.58/80 GB-per-rank wall, which is a hardware limit needing TP=8, not a code issue). test_ming_flash_omni_graph + talker_graph + test_graph all green; pre-existing test_worker_graphs_manager failures are unrelated (fail with this change stashed too).

…whitespace, E501)

zhudianGG · 2026-06-11T14:40:07Z

close for rebase in #115

zhudianGG and others added 21 commits June 6, 2026 00:11

ming_flash_omni: ruff lint fixes (PR1 — import sort, unused imports, …

b693988

…whitespace, E501)

zhudianGG closed this Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)#104

[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)#104
zhudianGG wants to merge 21 commits into
mainfrom
noah_ming_understanding

zhudianGG commented Jun 10, 2026

Uh oh!

zhudianGG commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhudianGG commented Jun 10, 2026

Uh oh!

zhudianGG commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant