Skip to content

[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)#104

Closed
zhudianGG wants to merge 21 commits into
mainfrom
noah_ming_understanding
Closed

[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)#104
zhudianGG wants to merge 21 commits into
mainfrom
noah_ming_understanding

Conversation

@zhudianGG

Copy link
Copy Markdown
Collaborator

PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)

Base: noah_model_support · Head: noah_ming_understanding · 8 commits
Compare: noah_model_support...noah_ming_understanding

zhudianGG and others added 21 commits June 6, 2026 00:11
Benchmark (runnable today):
  * benchmark/base.py: MingFlashOmni model (inclusionAI/Ming-flash-omni-2.0,
    all 8 omni modalities T2T/I2T/A2T/V2T + T2S/I2S/A2S/V2S, max_tokens=256
    for cross-system fairness, no system preamble) + ModelType.MING_FLASH_OMNI.
  * benchmark/vllm_omni_instructions.md: launch commands for vllm-omni's
    ming_flash_omni{,_thinker_only,_tts} deploy yamls.
  * Benchmarks Ming today via --inference-system vllm_omni against a
    vllm-omni server.

Native mminf port (scaffold only — every abstractmethod raises
NotImplementedError; mminf-serve will fail at startup until filled in):
  * mminf/model/ming_omni_flash/{config,ming_omni_flash_model}.py:
    file/class shape mirroring mminf/model/qwen3_omni/ with pointers to
    the upstream vllm-omni reference (~6,500 LOC).
  * mminf/model/ming_omni_flash/PORTING_NOTES.md: 12-step punch list
    mapping each mminf surface to the upstream vllm-omni file + closest
    Qwen3-Omni parallel.
  * mminf/model/registry.py: registered under "ming_flash_omni" with HF id.
  * configs/ming_flash_omni{,_thinker_only}.yaml: starter deploy topologies
    mirroring vllm-omni's, marked WIP.
Step 1 of mminf/model/ming_omni_flash/PORTING_NOTES.md. Replaces the
placeholder config.py with a full dataclass tree mirroring
mminf/model/qwen3_omni/config.py: ThinkerLLMConfig (Ling-2.0 256-expert
MoE, head_dim=128, partial_rotary_factor=0.5, mrope_section=[8,12,12]),
VisionEncoderConfig (Qwen3-MoE ViT 27L, out_hidden=4096),
AudioEncoderConfig (Whisper 32L with Ming-side ds_kernel/ds_stride/
norm_query knobs), plus skeleton TalkerConfig + ImageGenConfig that
lazy-load from the released checkpoint's sibling subdirs
(talker/{config,llm,vae}.json, transformer/, mlp/, etc.) — those two
get full field semantics at steps 6 and 9.

The released ckpt does NOT match upstream vllm-omni's flat
MingFlashOmniConfig nesting; top-level config.json is the
BailingMM2Config shape only, so the loader walks subdirs instead of
parsing a single nested dict.

__post_init__ sanity checks fail loudly on the silent-miswire patterns
(head_dim inconsistency, MRoPE section that doesn't partition the
rotary cos/sin half, multimodal token IDs outside vocab).

MingFlashOmniModel.__init__ now resolves the snapshot and loads the
config before raising NotImplementedError, so the load path is
exercised end-to-end even though no submodules / graph walks exist
yet (those are steps 3+).

Verified: pytest test/modular/test_ming_flash_omni_config.py passes
10/10 against the released checkpoint locally cached at
~/.cache/huggingface/hub/models--inclusionAI--Ming-flash-omni-2.0/;
tests skip cleanly when the snapshot isn't present.
Step 2 of mminf/model/ming_omni_flash/PORTING_NOTES.md. The released
HF checkpoint ships only weights + sub-dir configs — none of the
tokenizer / processor / modeling Python modules that AutoTokenizer
and AutoProcessor's trust_remote_code path expects to find next to
config.json. Those live in the Ming source repo at
https://github.com/inclusionAI/Ming .

This commit:

  * Adds _prepare_tokenizer_dir + _find_ming_code_dir helpers that
    symlink the required Ming .py and .json assets from a separately
    cloned source repo (located via MING_CODE_DIR env, ./Ming, or
    /tmp/ming_repo) into the snapshot dir, and push the snapshot onto
    sys.path so transformers' dynamic-module loader's sibling imports
    resolve.
  * Loads BailingTokenizer + BailingMM2Processor with graceful
    fallback: when the source repo or its extra deps are missing,
    init logs a clear how-to-fix warning and leaves self.tokenizer /
    self._processor as None instead of crashing.
  * Documents the Ming source dependency + setup steps in
    PORTING_NOTES.md.

Also corrects the benchmark/base.py:MingFlashOmni docstring on role
mapping: it previously claimed BailingMM2Processor maps OpenAI roles,
but BailingMM2Processor is strict and rejects user/assistant.
What actually happens is the *jinja* chat_template in
tokenizer_config.json does the remap. vllm-omni serves via
tokenizer.apply_chat_template (which uses the jinja), so the
benchmark wire format is correct; the native mminf process_prompt
(step 7) will need to remap roles before invoking
BailingMM2Processor.apply_chat_template.

Verified: 11 new tests in test/modular/test_ming_flash_omni_tokenizer.py
pass against the released ckpt + a clone of inclusionAI/Ming at
/tmp/ming_repo. All 21 ming tests skip cleanly when either the
snapshot or the source repo is absent. Ruff clean.
The released inclusionAI/Ming-flash-omni-2.0 doesn't load straight into
vllm-omni: the snapshot ships BailingMM2-flavoured processor configs
and talker weights with an `audio.*` prefix, while vllm-omni's
MingFlashOmniForConditionalGeneration registers Qwen2VLImageProcessor +
MingWhisperFeatureExtractor and expects `audio_vae.*` for the talker.

The fix is to build a hybrid snapshot — inclusionAI's thinker
safetensors (the only heavy bit, ~200 GB) plus Jonathan1909's
repackaged metadata files + talker weights (~3 GB extra). This avoids
re-downloading the thinker.

Adds the explicit launch + benchmark recipe to
benchmark/vllm_omni_instructions.md, including the served-model-id
quirk (vllm-omni reports the local serve path verbatim and 404s on
the canonical HF id) and a results table from a local 4×H100 run on
2026-06-06:
  T2T offline B=1:  110 tok/s
  T2T closed-loop C=8: 493 tok/s
  T2S: RTF 0.14 (real-time factor; <1 = faster than real-time)
  I2T + A2T both validated end-to-end.
Adds results/ming_t2t_sweep/SUMMARY.md with the throughput curve from
a 6-point concurrency sweep on 4×H100 against the running vllm-omni
hybrid-snapshot Ming server:

  c=1   →  110 tok/s   (single-stream baseline)
  c=2   →  199 tok/s   (1.8×)
  c=4   →  356 tok/s   (3.2×)
  c=8   →  573 tok/s   (5.2×)
  c=16  →  888 tok/s   (8.1×)
  c=32  → 1060 tok/s   (9.6×; knee here)

All 470 requests across the sweep succeeded; TTFT stays 28-91 ms.

benchmark/vllm_omni_instructions.md: expand the modalities-exercised
table from the 4 modalities run in the previous session (T2T/I2T/A2T/
T2S) to all 8 omni paths (adds V2T/V2S/I2S/A2S, all green). Documents
the direct-OpenAI-path workaround for V2T/V2S/A2S, used to sidestep
UCF101 + LibriSpeech dataset downloads when disk is full.
Two small-N quality checks against the same vllm-omni Ming server used
for the throughput sweep:

  MMLU      78.9% accuracy on 285 items (cais/mmlu, ~5 per subject, 0-shot)
  VideoMME  56.9% accuracy on  51 items (chunk1 subset, stratified by
                                          duration, 0-shot)

Both at temperature=0, parse rates ≥99%. MMLU runs in 13s (~22 req/s,
text-only); VideoMME takes ~10 min wall (~11 s/req, base64-inlined mp4s).

ACCURACY.md ships the per-subject (worst/best 10) and per-task-type
breakdowns. Notable: VideoMME medium-duration accuracy (29%) is much
lower than short (77%) or long (65%) — likely sample variance at N=17/
bucket, but flagged. Temporal Reasoning subtype 0/3 is also worth a
larger-sample follow-up.

These are spot checks, not publishable numbers; caveats are inlined in
ACCURACY.md. Per-item results.json files (gitignored) sit beside it
locally for drill-down.
Step 3a of mminf/model/ming_omni_flash/PORTING_NOTES.md. Adds the three
architecture-specific pieces of the Ling-2.0 thinker that don't map
cleanly onto mminf's existing components/, ahead of assembling the
full BailingMoeV2 decoder layer in step 3b:

components/router.py — LingMoeRouter:
  * sigmoid + learned (non-grad) expert bias + group-limited top-k
    (n_group=8 groups, topk_group=4) + routed_scaling_factor
  * returns (logits, weights, indices) tuple so it drops straight into
    mminf's SparseMoeBlockWithSharedExpert + the fused-Triton dispatch

components/rope.py — LingPartialMRotaryEmbedding:
  * partial rotary (head_dim * 0.5 dims rotated, rest pass-through)
  * 3D video_rope cos/sin remap [H W H W ... T T T] — the unusual
    interleaving Ming uses instead of standard MRoPE's contiguous
    [T T H H W W] layout
  * degenerates to plain 1D rotary on 1D position_ids

components/attention.py — LingAttention:
  * per-head RMSNorm on q and k before rope (use_qk_norm: True on the
    released ckpt — standard ParallelAttention doesn't bake this in)
  * composes the rope module + GQA + causal SDPA
  * step-3a scope is batch=1 unit-test; full TP path lands step 3b

test/modular/test_ming_flash_omni_components.py — 12 tests:
  * router: shapes/scaling, group-limit isolation, expert-bias shift,
    bad-config rejection, vllm-omni indices cross-check (skip when
    vllm-omni not importable in venv)
  * rope: shapes + pass-through, 1D = plain rotary, video_rope axis
    assignment (zero-row sentinel test), inconsistent-section rejection
  * attention: forward runs (CUDA only — mminf RMSNorm uses flashinfer's
    CUDA kernel), QK-norm produces unit-RMS output, causal mask doesn't
    leak future tokens

Result: 11 component tests pass + 21 existing config/tokenizer tests
still green (32 total Ming tests). vllm-omni cross-check skips cleanly
in mminf's venv (vllm_omni is only installed in the vllm venv) and
when run manually requires a vllm config context that's non-trivial to
bootstrap outside vllm's own test harness.

Out of scope: BailingMoeV2DecoderLayer (hybrid dense/MoE per
first_k_dense_replace) — step 3b. BailingMoeV2Model + weight loader +
mminf submodule wiring — step 3c.
Step 3b of mminf/model/ming_omni_flash/PORTING_NOTES.md. Assembles the
step-3a components (LingMoeRouter, LingPartialMRotaryEmbedding,
LingAttention) into the layer and full-thinker forward.

Real find while reading upstream: Ling's MultiRouter isn't a single
grouped-topk router — it's THREE routers (text gate, image_gate,
audio_gate) mixed per-token by image/audio modality masks.
LingMoeRouter from step 3a is correct as the per-router primitive;
this step adds the multi-router composition around it.

components/moe.py — LingMoeBlock:
  * 3 LingMoeRouter instances (gate / image_gate / audio_gate)
  * Fused expert weights matching mminf SparseMoeBlock's packed layout
    (gate_up_proj, down_proj) — step-3c weight loader can reuse the
    existing primitives
  * GatedMLP shared expert of moe_intermediate_size * num_shared_experts
    width; output is added unconditionally and ungated (matches
    upstream — no shared_expert_gate sigmoid trick)
  * forward(hidden, image_mask=None, audio_mask=None): text gate runs
    always, image/audio gates run + torch.where-swap their picks at
    masked positions

components/decoder_layer.py — LingDecoderLayer:
  * pre-norm pattern (RMSNorm + LingAttention + residual)
  * branches on layer_idx: GatedMLP (intermediate_size=9216) when
    layer_idx < first_k_dense_replace, else LingMoeBlock
  * threads image_mask/audio_mask only to the MoE branch

components/model.py — LingMoeModel:
  * Embed + ModuleList of N LingDecoderLayer + RMSNorm + lm_head
  * Single shared LingPartialMRotaryEmbedding instance across layers
  * forward accepts input_ids OR input_embeds (multimodal callers
    will splice vision/audio embeds in step 4+), returns
    (T, vocab_size) logits — no last-position slicing here

test/modular/test_ming_flash_omni_model.py — 9 tests:
  * MoE block: text-only shape, image mask routes through image_gate,
    shared expert contributes, bad-mask-shape rejection
  * Model: input_ids/embeds XOR contract; full forward shape; embed
    bypass; dense-vs-MoE layer-index branch differs; end-to-end causal

41 of 42 Ming tests passing (1 skipped: vllm-omni cross-check needs
vllm-omni in mminf venv; step 3a). Lint clean.

Out of scope (step 3c):
  - KV cache wiring on LingAttention
  - Safetensors weight loader (per-expert gate/up/down fusion across
    256 separate keys into the packed gate_up_proj param)
  - BailingMoeV2ThinkerSubmodule wrapping LingMoeModel for mminf's
    engine/graph-walk machinery
  - Real-checkpoint smoke test (load shard 1, run forward, verify
    finite outputs against vllm-omni's output)
  - TP-aware ParallelAttention/ParallelMoeBlock variants
Step 3c of mminf/model/ming_omni_flash/PORTING_NOTES.md. Maps the
released inclusionAI/Ming-flash-omni-2.0 checkpoint into the
LingMoeModel built in steps 3a + 3b, and verifies the load + forward
end-to-end against the real shards.

loader.py:
  * _RENAME_RULES — 18 patterns mapping the ckpt's HF naming convention
    (model.model.layers.{i}.attention.query_key_value.weight,
    .mlp.gate.weight, .mlp.experts.{j}.gate_proj.weight, etc.) into
    LingMoeModel's state_dict names (layers.{i}.self_attn.qkv_proj.weight,
    .mlp.gate.gate.weight, .mlp.experts.gate_up_proj after fusion).
  * build_ling_weight_converters() — reuses mminf's existing
    MergeModulelist + Concatenate Operations to pack 256 per-expert
    gate_proj/up_proj/down_proj weights per MoE layer into the dense
    (256, 2*moe_inter, hidden) and (256, hidden, moe_inter) tensors
    LingMoeBlock expects.
  * load_thinker_weights(model, local_dir, device, strict=True) —
    iterates shards via iter_safetensors_shards, applies the rename
    pass, buckets per-expert weights per layer, runs the fusion
    converters, and assigns to model.state_dict. Strict mode raises
    on missing target params or unmatched ckpt keys; non-strict skips.

__init__.py — re-exports LingMoeModel and load_thinker_weights so
external callers can `from mminf.model.ming_omni_flash import ...`
without crawling into components/.

test_ming_flash_omni_loader.py — 6 tests:
  * Pure-Python (always run): rename rules cover layer-0 dense keys,
    rename rules cover MoE-layer keys, expert fusion produces
    correctly-packed (256, 2*inter, hidden) tensor with gate/up halves
    in expected positions, strict mode raises on missing params.
  * Real-ckpt (CUDA + snapshot gated): load embed + dense layer 0 +
    norm + lm_head from the released shards (~3 GB) into a 1-layer
    LingMoeModel; forward 4 token ids returns (4, 157184) finite bf16
    logits. Second test verifies every layer-0 attention parameter has
    the expected shape after load.

49 of 50 Ming tests passing (1 skipped: vllm-omni router cross-check
needs vllm-omni in mminf venv; step 3a). Real-ckpt smoke confirms the
model-side code matches the upstream architecture: random tokens →
finite logits after embed + 1 dense transformer layer + lm_head, with
1024-dim packed QKV correctly split into Q (32×128) / K (4×128) /
V (4×128), and SDPA running on bf16 weights.

Out of scope (step 3d):
  - KV cache wiring on LingAttention (currently uses inline SDPA;
    needs mminf's cache_handle plumbing)
  - BailingMoeV2ThinkerSubmodule in submodules.py — wraps LingMoeModel
    into mminf's ARNodeSubmodule interface so the engine can drive it
  - Full multi-layer forward verification against a vllm-omni-served
    reference (the "byte-equality with upstream" test — needs all 32
    layers loaded across multiple GPUs)
  - TP-aware variants (ParallelAttention / ParallelMoeBlock + a
    TP-rank-aware weight loader)
… (step 3d)

Step 3d of mminf/model/ming_omni_flash/PORTING_NOTES.md. Connects the
LingMoeModel built in 3a-3c to mminf's engine: wires KV cache through
attention, adds the submodule the engine calls, fills in every
MingFlashOmniModel ABC method for the text-only path.

components/attention.py — LingAttention now calls
cache_handle.run_attention(q, k, v) (paged KV write + masked SDPA via
FlashInfer) instead of inline F.scaled_dot_product_attention. Keeps
the custom partial-3D video_rope rotation inline (we don't use
cache_handle.apply_rope). Forward signature is now packed-tokens
(num_tokens, hidden) + cache_handle + position_ids — the layout the
mminf engine actually uses.

components/decoder_layer.py + components/model.py — thread cache_handle
through to attention; LingMoeModel.forward calls cache_handle.set_layer_idx(i)
before each layer's forward. cache_handle is the new first positional
arg of model.forward (everything after stays kwarg).

submodules.py (new) — BailingMoeV2ThinkerSubmodule wraps LingMoeModel
into mminf's ARNodeSubmodule contract: prepare_inputs builds
ARNodeInputs from token ids; preprocess plans the cache + packs the
batch (single-request only in 3d); forward runs the LingMoeModel +
advance_seq_lens; check_stop returns {"decode_loop"} when the
sampled token is <|role_end|> (id 156895). Mirrors Orpheus's text-LLM
template closely.

ming_omni_flash_model.py — removed the raise-NotImplementedError that
made the scaffold un-instantiable; implemented every Model ABC method
for the thinker text-only path: get_kv_cache_config (Ling-2.0 dims
from config.thinker_llm), get_node_engine_types ({"Thinker": KV_CACHE}),
get_graph_walk_graphs (prefill + decode_loop), get_partition_topology
(single Thinker partition), get_initial_forward_pass_args +
get_partition_forward_pass_args (mirrors Orpheus's prefill→decode→done
flow), process_prompt (jinja chat_template with the model's tokenizer
— OpenAI-standard "user" role works), postprocess (decode tokens to
utf-8), get_submodule (builds LingMoeModel + calls load_thinker_weights
+ returns BailingMoeV2ThinkerSubmodule).

configs/ming_flash_omni_thinker_only.yaml — simplified to register
only the Thinker node (audio_encoder/vision_encoder lands at step 4+).
Single-rank by default — TP=4 needs step-3e TP-aware variants.

Tests (test_ming_flash_omni_{components,model,loader}.py) — updated
to pass a _MockCacheHandle through every forward call. The mock
implements set_layer_idx + run_attention(SDPA-based) — the same
behavior the inline path had before the refactor, so test semantics
are unchanged. Real-ckpt smoke (step 3c's layer-0 forward through the
embed + 1 dense layer + lm_head) still produces finite bf16 logits
with the new signature.

End-to-end mminf-serve smoke (substep 4): mminf-serve --config
ming_flash_omni_thinker_only.yaml --tensor-comm-protocol SHM
successfully starts uvicorn, instantiates MingFlashOmniModel, calls
get_submodule("Thinker"), and starts loading weights via
load_thinker_weights — failing with OOM after ~75 GB on a single 80
GB H100. This is the expected blocker without TP-aware code: the
full 100B-param model needs TP=4 across 4 GPUs to fit. The engine
plumbing itself works end-to-end; step 3e (TP-aware ParallelAttention /
ParallelMoeBlock + TP-rank-aware weight loader) is the remaining
piece for actual serving.

47 of 48 Ming tests pass (1 skipped: vllm-omni router cross-check
needs vllm-omni in mminf venv from step 3a). Lint clean.
Step 3e of mminf/model/ming_omni_flash/PORTING_NOTES.md. Makes the
LingMoeModel TP-aware so the full 100B-param model actually fits
across multiple H100s (single-GPU OOMed at 75 GB in step 3d's smoke).

components/attention.py — LingAttention now wraps mminf's
QKVParallelLinear (per-rank head sharding, weight_loader handles
"q"/"k"/"v" shard_ids) + RowParallelLinear (all-reduces output dim).
Per-rank num_heads / num_kv_heads come from the qkv_proj after
construction. QK-norm + partial-3D video_rope stay inline (head_dim-
shaped operations identical at every rank).

components/moe.py — LingMoeBlock now allocates expert tensors with
shard_inter = moe_intermediate_size // tp_size, attaches mminf's
existing _gate_up_weight_loader / _down_proj_weight_loader (per-rank
slicing along the intermediate dim, shard_ids "gate:N"/"up:N"/"down:N"
per-expert). Shared expert becomes ParallelGatedMLP (its down_proj
all-reduces internally). TP>1 forward mirrors
ParallelSparseMoeBlock._dispatch_tp: fused_experts(reduce_results=False)
+ comm_group.all_reduce + moe_sum_reduce_triton.

components/decoder_layer.py + components/model.py — comm_group plumbed
through every constructor. Dense layer-0 MLP becomes ParallelGatedMLP.

loader.py — full refactor onto mminf's load_hf_weights + StackedParamRule
machinery (replaces step 3c's custom loader). New shape:
  * _strip_outer_model_prefix + _apply_substring_renames + per-expert
    __expertN__ marker rewrite in _remap_thinker_keys
  * _split_packed_qkv splits the ckpt's packed query_key_value.weight
    into three synthetic q_proj/k_proj/v_proj entries, which the
    standard q/k/v StackedParamRules route into QKVParallelLinear's
    fused qkv_proj
  * _build_thinker_stacked_params dynamically builds 3 × num_experts
    rules + dense MLP gate/up + synthetic QKV rules (770 total for
    Ling-2.0's 256 experts)
Per-rank weight slicing is automatic via the parameter-attached
weight_loaders on every Parallel* module.

ming_omni_flash_model.py — _create_thinker_submodule (no longer in
inline get_submodule) constructs LingMoeModel(comm_group=tp_group) on
the meta device, .to_empty(device=device).to(bf16), then loads via
load_thinker_weights. get_default_sharding_config declares Thinker as
TP-capable. configs/ming_flash_omni_thinker_only.yaml: tp_size=8 on
GPUs 0-7 (TP=4 hit OOM at 78.58/80 GB; TP=8 has plenty of headroom).

Tests:
  * components/model tests: switched to _init_dispatch_weights helper
    that initialises every Parallel* param the constructor allocated
    (Parallel* modules use torch.empty for params; real weight loading
    overwrites them in production, tests need explicit init).
  * test_ming_flash_omni_loader.py: rewritten for the new helpers
    (_remap_thinker_keys, _build_thinker_stacked_params,
    _split_packed_qkv). Real-ckpt smoke loads embed + 1 dense layer +
    norm + lm_head and runs a forward — 1 layer's worth of finite
    bf16 logits at vocab=157184.

47 of 48 Ming tests pass (1 skipped: vllm-omni router cross-check).
Lint clean.

End-to-end mminf-serve smoke (TP=8 on 8 H100s):
  ✅ uvicorn starts on :8092
  ✅ All 8 workers load 507 thinker params each (~50 sec total)
  ✅ KVCacheEngine warmup_and_capture + torch.compile applied
  ✅ Dedicated GPU threads + plan_executor spin up
  ❌ First /generate request: IndexError in
     BailingMoeV2ThinkerSubmodule.prepare_inputs — per-request
     text_inputs list arrives empty. Integration bug between
     get_initial_forward_pass_args / graph walks / the conductor's
     prompt-to-input-signals routing, NOT a model code bug. All the
     heavy plumbing works; needs a small follow-up to wire the prompt
     tokens through to the first prefill call. Documented in
     PORTING_NOTES.md.

Out of scope (step 3f and step 4+):
  - Fix the text_inputs-routing for the first prefill call (small but
    needs a debug session walking the conductor → worker dispatch path)
  - Multi-request batching in BailingMoeV2ThinkerSubmodule
  - Vision / audio encoders + their prefill walks
  - Talker / AudioVAE / image-gen
Closes two items from the mminf↔vllm-omni correctness review:

* Add a parametrised numeric parity test for ``LingPartialMRotaryEmbedding._remap_video_rope`` vs
  vllm-omni's ``MingVideoRopeMRotaryEmbedding._remap_video_rope``. mminf operates on the full
  ``(3, T, rotary_dim)`` neox-cat table while vllm operates on the ``(3, T, rotary_dim/2)`` half
  table; both halves of our output must equal vllm's half output. 6 cases cover the released
  ckpt geometry (mrope_section=[8,12,12]) plus edges where hw_size==half (no temporal tail),
  hw_size<<half, and asymmetric Nh≠Nw.

* Add the missing multimodal token IDs (``audio_patch_token``, ``audio_start_token``,
  ``audio_end_token``, ``image_end_token``, ``video_end_token``) and ``tokens_per_second`` to
  ``ThinkerLLMConfig`` with tokenizer-truth defaults. Without them, the vision/audio masking +
  MRoPE temporal-position pipeline (porting step 4) has nowhere to read these constants from.

Also repair an upstream mislabel found while wiring those defaults: the inclusionAI ckpt's
``llm_config.video_start_token`` is 157159, but per the tokenizer 157159 is ``</image>`` and
the real ``<video>`` token is 157160. Jonathan1909's patched config and vllm-omni's hardcoded
default both have 157160. ``__post_init__`` now detects the bogus value, repairs it in place,
and warns loudly so a future ckpt that intentionally rebinds the field doesn't get silently
overridden. Extend the vocab-bounds validator to cover the five newly-added token fields and
add regression tests for both behaviours.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two model-side bugs blocked the first end-to-end /generate response on
top of step 3e:

(a) BailingMoeV2ThinkerSubmodule had no postprocess hook, so the
    decode loop's text_inputs edge never received the freshly sampled
    new_token. Added postprocess that rebinds new_token -> text_inputs,
    mirroring OrpheusLLMSubmodule.

(b) Prefill / decode output edges used EMPTY_DESTINATION +
    conductor_new_token=True. With (a) fixed the loop produced tokens
    but the API server received {"outputs": {}} because no edge routed
    new_token to the client. Switched to Qwen3-Omni's pattern: emit
    each token via parallel EMIT_TO_CLIENT (output_modality="text")
    edges alongside the text_inputs loopback.

Also collected environment-side patches required to actually reach a
working forward on this box:

* BailingTokenizer doesn't load under transformers >= 5.0 (verbose
  removed; add_bos_token setter touches not-yet-built _tokenizer).
  _patch_bailing_tokenizer_for_transformers5 applies both fixes
  lazily after the first AttributeError.

* LingMoeBlock._dispatch_tp now falls back to dispatch_experts_fused
  + all-reduce when sgl_kernel is unloadable, which is the case here
  due to an ABI mismatch against the installed torch. Math is
  equivalent (sum-over-TP and sum-over-top-k commute).

Verified via mminf-serve smoke (TP=8 on 8 H100s): /generate returns
real model text. Updated configs/ming_flash_omni_thinker_only.yaml
comments and PORTING_NOTES.md punch list accordingly.
Three new stateless modules with weight-key parity against the
released ckpt's top-level prefixes:

* MingVisionProjector / MingAudioProjector (components/projectors.py):
  Port the nn.Sequential chains built inline in modeling_bailingmm2.py
  into standalone modules. Layer indices match the on-disk keys
  (linear_proj.{0,2} for vision, linear_proj_audio.{0,3} for audio).

* build_vision_encoder (components/vision_encoder.py):
  Construct Ming's Qwen3MoeVisionTransformer via dynamic import from
  the staged Ming source dir (the same path used by the tokenizer +
  processor). The encoder is ~1 GB at bf16 and runs on a single GPU,
  so we use the reference implementation directly rather than fork.

* MingAudioEncoder (components/audio_encoder.py):
  Self-contained port of vllm-omni's packed-sequence Whisper encoder
  (~250 LOC). No openai-whisper runtime dep — optional flash-attn
  varlen fast path with a manual padded-attention fallback. Param
  names match upstream Whisper (query/key/value/out,
  mlp.{0,2}.{weight,bias}) so the released ckpt's audio.blocks.N.*
  keys load by state-dict equality.

17 tests in test/modular/test_ming_flash_omni_encoders.py: 12
pure-Python (projector shapes/indices/forward, audio encoder weight-key
parity, packed-attention fallback) + 1 snapshot-gated (vision encoder
builds from real VisionEncoderConfig) + 1 CUDA-gated (forward smoke
under eager attention, currently skipped on this box for missing
libnvrtc-builtins — not a code bug; will re-verify when step 5 wires
encoders into the prefill walk).

PORTING_NOTES step 4a updated; 4b (extend loader.py to actually load
the vision/audio/projector subtrees from the snapshot) is the next
sub-step before the encoders can be wired into a live graph walk.
Adds four loader entry points on top of a shared
_load_prefixed_state_dict helper:

  * load_vision_encoder_weights    (prefix=vision.)
  * load_audio_encoder_weights     (prefix=audio.)
  * load_vision_projector_weights  (prefix=linear_proj., inner=proj.)
  * load_audio_projector_weights   (prefix=linear_proj_audio., inner=proj.)

None of these are TP-aware — vision + audio encoders colocate on
rank 0 in the typical topology (see configs/ming_flash_omni.yaml),
so plain prefix-strip + load_state_dict suffices. The projector
loaders prepend `proj.` so the on-disk linear_proj.{0,2}.* and
linear_proj_audio.{0,3}.* keys hit the nn.Sequential slot by
integer index.

Verified by 4 snapshot-gated tests against /dev/shm/ming-hybrid: all
four prefixes load strictly (no missing / unexpected keys). The audio
encoder's positional_embedding is loaded as a buffer (overrides the
local sinusoidal init); the vision encoder loads all 27 blocks +
merger + deepstack_merger_list cleanly.

Snapshot lookup in the test helper now prefers /dev/shm/ming-hybrid
(merged shards + index) over the HF-Hub snapshot dir (which only has
the index symlink — shards live elsewhere on this box).

Step 4a + 4b complete; step 5 (wire encoders into prefill graph
walks) is the next slice.
…n (step 5a)

Add the two encoder NodeSubmodules and their construction paths so
the Thinker can pull vision/audio embeddings off graph nodes once
step 5b/5c land the prefill walks.

* VisionEncoderSubmodule wraps Qwen3MoeVisionTransformer +
  MingVisionProjector and mirrors
  modeling_bailingmm2.extract_image_feature (encoder → projector
  → F.normalize). prepare_inputs raises clearly on missing
  pixel_values / image_grid_thw and promotes 1-D [T, H, W] grid_thw
  to (1, 3).

* AudioEncoderSubmodule wraps MingAudioEncoder + MingAudioProjector.
  Accepts a single (n_mels, T) clip or (B, n_mels, T) batched
  tensor, optionally trims the padded tail using audio_seqlens, and
  concatenates per-clip embeddings along time. L2-norm applies when
  audio_config.norm_query_embeds is set (true on the released ckpt
  — matches modeling_bailingmm2.extract_audio_feature).

* get_node_engine_types now registers vision_encoder and audio_encoder
  as EngineType.STATELESS alongside the KV-cache Thinker.
  Construction routes through _create_vision_encoder_submodule /
  _create_audio_encoder_submodule helpers that build, dtype-cast, and
  weight-load via the loaders from step 4b. flash_attention_2 is the
  default for the vision encoder (override via MING_VISION_ATTN_IMPL
  env var for non-FA2 dev boxes); audio encoder uses flash-attn varlen
  when available, manual fallback otherwise.

12 tests in test/modular/test_ming_flash_omni_submodules.py: 10
pure-Python (input validation, output shape, L2 norm, batched/single
equivalence, audio_seqlens trim, grid_thw promotion, node-type
registration, friendly error on unknown node) + 2 snapshot-gated
(_create_audio_encoder_submodule end-to-end on the real ckpt —
verifies Conv1 + projector params are non-zero post-load).

PORTING_NOTES step 5 broken out into 5a (this), 5b (Thinker prefill
dispatch for vision/audio modality routing), 5c (graph walks +
partition wiring + initial-forward-pass arg routing).
…osition helpers (step 5b)

BailingMoeV2ThinkerSubmodule.prepare_inputs now dispatches on
graph_walk and emits either input_ids (text-only walks) or
input_embeds + custom_pos_ids (multimodal walks). preprocess and
forward route both shapes through to LingMoeModel's existing dual
input_ids/input_embeds + 1D/3D position_ids handling — no new
model.py path needed.

Three new position-id helpers live in components/positions.py, each
producing (3, T) long tensors compatible with
LingPartialMRotaryEmbedding's video_rope branch:

* get_rope_index_text — three identical sequential rows.
  Pure-text branch of modeling_bailing_moe_v2.get_rope_index (:658-675).
* get_rope_index_audio — alias to text (Ming does not special-case
  audio in get_rope_index).
* get_rope_index_vision — per-image 3D grid math from :625-647 with
  optional video timestamp scaling via second_per_grid_t * tokens_per_second.

Thinker dispatch covers:
* prefill / prefill_text — backward-compat text path (unchanged).
* prefill_audio — wraps audio_embeds with audio_start / audio_end
  sentinel embeds, text-like 3D positions for the span.
* prefill_vision / prefill_video — wraps vision_embeds with
  image_start/image_end (or video_start/video_end), grid-aware 3D
  positions. eos sentinel sits at global_max(vision_pos) + 1 so the
  next walk's text positions resume without collision (matches
  llm_pos_ids_list[-1].max() + 1 in the source).
* decode / thinker_decode — single-token AR step (unchanged).

Sentinel embeds are lazily computed per device on first use; the
Thinker submodule now takes config= at construction so it can read
vision.spatial_merge_size, thinker_llm.tokens_per_second, and the
*_start_token / *_end_token ids. ming_omni_flash_model.py threads
self.config through to the submodule.

Step 5b restricts to single-image / single-clip requests; the
multi-image splice via Sequential graph wiring lands in 5c.

21 new tests across test_ming_flash_omni_positions.py (11) and
test_ming_flash_omni_submodules.py (10): position-id shape / offset
/ abs-time math, missing-input error paths, multi-image rejection,
sentinel embed correctness for audio / image / video walks,
start_pos advancement, legacy prefill walk name compat. All green.
get_graph_walk_graphs now returns five walks instead of the step 3f
text-only prefill/decode pair:

* prefill_text — bare Thinker node.
* prefill_audio — Sequential([audio_encoder, Thinker]); encoder emits
  audio_embeds into the Thinker.
* prefill_vision — Sequential([vision_encoder, Thinker]);
  image_grid_thw routes to BOTH the encoder (for spatial positions
  on the patches) AND the Thinker (for 3D MRoPE math around the
  vision span).
* prefill_video — same shape as prefill_vision plus
  video_second_per_grid routed into the Thinker.
* thinker_decode — AR loop, renamed from step 3f's decode.

get_partitions lists all five walks under the single Thinker partition
with initial_walk="prefill_text".

Two new helpers drive scheduling:

* _build_thinker_prefill_schedule(input_modalities, input_signals) —
  one schedule step per modality, in input_modalities order; each
  step is (walk_name, {input_name: TensorPointerInfo}). Modalities
  listed without matching tensors in input_signals are silently
  skipped (parity with qwen3_omni).
* _get_thinker_prefill_inputs(metadata, input_signals) — emits one
  GraphEdge per input for the current step, routing each to the right
  node (encoder vs Thinker), including the dual image_grid_thw edge
  for vision walks.

get_initial_forward_pass_args builds the schedule, picks the first
walk, and stashes the schedule + step counter on the metadata.
get_partition_forward_pass_args is the Thinker state machine: advance
schedule → transition to thinker_decode → return request_done=True
after the decode loop unwinds. Mirrors qwen3_omni_model.py:765+ minus
the Talker / Code2Wav partitions.

Empty-schedule edge case (no usable modalities) short-circuits to
request_done=True so the conductor doesn't hang.

21 tests in test/modular/test_ming_flash_omni_graph.py covering walk
structure, partition listing, schedule construction for all modality
mixes (incl. unknown-modality / no-inputs), per-walk edge routing,
and full state-machine drive across a text+audio request (init →
audio prefill → decode → done).

The submodule's backward-compat aliases for "prefill"/"decode" stay
in place so external callers that still emit the step 3f walk names
keep working.
… video (step 7)

MingFlashOmniModel.process_prompt now produces the full NameToTensorList
consumed by step 5c's prefill scheduler. Strategy mirrors
qwen3_omni's process_prompt: apply the chat template to TEXT-ONLY
messages (so the tokenizer doesn't insert placeholder tokens we'd
later have to strip), then run image / video / audio sub-processors
separately for each modality.

Uses tokenizer.apply_chat_template (jinja, accepts OpenAI
user/assistant/system roles) rather than the stricter
processor.apply_chat_template (asserts on uppercase HUMAN/ASSISTANT
only) — keeps the API surface OpenAI-compatible.

Inputs (tensors: NameToTensorList):
* image_inputs — list of CHW float [0,1] tensors per image. The
  internal _image_to_processor_input converts to HWC uint8 to avoid
  the upstream's double-rescale-to-zero bug. Single-channel inputs
  auto-broadcast to 3 channels.
* audio_inputs — raw 1-D float tensors OR (waveform, sample_rate)
  tuples (sample rate inferred from processor default 16 kHz when
  raw waveform is passed).
* video_inputs — list of (T, C, H, W) float tensors. Per-frame
  second_per_grid defaults to 1.0; override via
  kwargs["input_metadata"]["video"][i]["second_per_grid"].

Outputs (keys consumed by _build_thinker_prefill_schedule):
* text_inputs — list of 1-D long tensors per text turn.
* pixel_values, image_grid_thw — one entry per image.
* pixel_values_videos, video_grid_thw, video_second_per_grid — per
  video clip.
* audio_features (n_mels, T), audio_seqlens (length-1 long) — per
  audio clip. Upstream returns (B, T, n_mels); we transpose to
  (n_mels, T) per clip so AudioEncoderSubmodule.prepare_inputs can
  splice without a reshape.

17 tests in test/modular/test_ming_flash_omni_process_prompt.py
covering text-only / no-prompt / image / audio / video / mixed
paths, per-modality dispatch, missing-processor error paths,
CHW-float→HWC-uint8 conversion correctness (including grayscale +
uint8 pass-through), multi-image, video metadata override, plus a
snapshot-gated text+image end-to-end against the real
BailingMM2Processor. 16 green + 1 env-skip on this box.

Image-gen <image><imagePatch>*256</image> query-token block deferred
to step 9 (ImageGen partition; text-out generation works without it).
…groups

Found during the first live mminf-serve bring-up of Ming-flash-omni
(thinker-only config, TP=4 on GPUs 4-7). get_worker_graphs iterated
EVERY graph walk, including prefill_audio / prefill_vision /
prefill_video / talker, which reference encoder / talker nodes
(audio_encoder, vision_encoder, Talker). The thinker-only deploy only
declares `Thinker` in node_groups, so _divide_into_worker_graphs hit
KeyError: 'audio_encoder' while dividing the prefill_audio walk and
crashed conductor startup.

Fix: in get_worker_graphs, collect the node names a walk references via
graph.get_nodes() and skip any walk whose required nodes aren't all
present in the config's node_groups. A partial deploy (thinker-only,
talker-only, etc.) simply can't serve the walks for nodes it doesn't
host — that's correct behaviour, not an error.

This is generic framework behaviour (any model with optional partitions
benefits), not Ming-specific. Verified: thinker-only conductor startup
now proceeds past worker-graph division to weight loading (then OOMs at
the documented TP=4 ~78.58/80 GB-per-rank wall, which is a hardware
limit needing TP=8, not a code issue). test_ming_flash_omni_graph +
talker_graph + test_graph all green; pre-existing
test_worker_graphs_manager failures are unrelated (fail with this
change stashed too).
@zhudianGG zhudianGG closed this Jun 11, 2026
@zhudianGG

Copy link
Copy Markdown
Collaborator Author

close for rebase in #115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant