[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)#104
Closed
zhudianGG wants to merge 21 commits into
Closed
[Ming-Omni] PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)#104zhudianGG wants to merge 21 commits into
zhudianGG wants to merge 21 commits into
Conversation
Benchmark (runnable today):
* benchmark/base.py: MingFlashOmni model (inclusionAI/Ming-flash-omni-2.0,
all 8 omni modalities T2T/I2T/A2T/V2T + T2S/I2S/A2S/V2S, max_tokens=256
for cross-system fairness, no system preamble) + ModelType.MING_FLASH_OMNI.
* benchmark/vllm_omni_instructions.md: launch commands for vllm-omni's
ming_flash_omni{,_thinker_only,_tts} deploy yamls.
* Benchmarks Ming today via --inference-system vllm_omni against a
vllm-omni server.
Native mminf port (scaffold only — every abstractmethod raises
NotImplementedError; mminf-serve will fail at startup until filled in):
* mminf/model/ming_omni_flash/{config,ming_omni_flash_model}.py:
file/class shape mirroring mminf/model/qwen3_omni/ with pointers to
the upstream vllm-omni reference (~6,500 LOC).
* mminf/model/ming_omni_flash/PORTING_NOTES.md: 12-step punch list
mapping each mminf surface to the upstream vllm-omni file + closest
Qwen3-Omni parallel.
* mminf/model/registry.py: registered under "ming_flash_omni" with HF id.
* configs/ming_flash_omni{,_thinker_only}.yaml: starter deploy topologies
mirroring vllm-omni's, marked WIP.
Step 1 of mminf/model/ming_omni_flash/PORTING_NOTES.md. Replaces the
placeholder config.py with a full dataclass tree mirroring
mminf/model/qwen3_omni/config.py: ThinkerLLMConfig (Ling-2.0 256-expert
MoE, head_dim=128, partial_rotary_factor=0.5, mrope_section=[8,12,12]),
VisionEncoderConfig (Qwen3-MoE ViT 27L, out_hidden=4096),
AudioEncoderConfig (Whisper 32L with Ming-side ds_kernel/ds_stride/
norm_query knobs), plus skeleton TalkerConfig + ImageGenConfig that
lazy-load from the released checkpoint's sibling subdirs
(talker/{config,llm,vae}.json, transformer/, mlp/, etc.) — those two
get full field semantics at steps 6 and 9.
The released ckpt does NOT match upstream vllm-omni's flat
MingFlashOmniConfig nesting; top-level config.json is the
BailingMM2Config shape only, so the loader walks subdirs instead of
parsing a single nested dict.
__post_init__ sanity checks fail loudly on the silent-miswire patterns
(head_dim inconsistency, MRoPE section that doesn't partition the
rotary cos/sin half, multimodal token IDs outside vocab).
MingFlashOmniModel.__init__ now resolves the snapshot and loads the
config before raising NotImplementedError, so the load path is
exercised end-to-end even though no submodules / graph walks exist
yet (those are steps 3+).
Verified: pytest test/modular/test_ming_flash_omni_config.py passes
10/10 against the released checkpoint locally cached at
~/.cache/huggingface/hub/models--inclusionAI--Ming-flash-omni-2.0/;
tests skip cleanly when the snapshot isn't present.
Step 2 of mminf/model/ming_omni_flash/PORTING_NOTES.md. The released HF checkpoint ships only weights + sub-dir configs — none of the tokenizer / processor / modeling Python modules that AutoTokenizer and AutoProcessor's trust_remote_code path expects to find next to config.json. Those live in the Ming source repo at https://github.com/inclusionAI/Ming . This commit: * Adds _prepare_tokenizer_dir + _find_ming_code_dir helpers that symlink the required Ming .py and .json assets from a separately cloned source repo (located via MING_CODE_DIR env, ./Ming, or /tmp/ming_repo) into the snapshot dir, and push the snapshot onto sys.path so transformers' dynamic-module loader's sibling imports resolve. * Loads BailingTokenizer + BailingMM2Processor with graceful fallback: when the source repo or its extra deps are missing, init logs a clear how-to-fix warning and leaves self.tokenizer / self._processor as None instead of crashing. * Documents the Ming source dependency + setup steps in PORTING_NOTES.md. Also corrects the benchmark/base.py:MingFlashOmni docstring on role mapping: it previously claimed BailingMM2Processor maps OpenAI roles, but BailingMM2Processor is strict and rejects user/assistant. What actually happens is the *jinja* chat_template in tokenizer_config.json does the remap. vllm-omni serves via tokenizer.apply_chat_template (which uses the jinja), so the benchmark wire format is correct; the native mminf process_prompt (step 7) will need to remap roles before invoking BailingMM2Processor.apply_chat_template. Verified: 11 new tests in test/modular/test_ming_flash_omni_tokenizer.py pass against the released ckpt + a clone of inclusionAI/Ming at /tmp/ming_repo. All 21 ming tests skip cleanly when either the snapshot or the source repo is absent. Ruff clean.
The released inclusionAI/Ming-flash-omni-2.0 doesn't load straight into vllm-omni: the snapshot ships BailingMM2-flavoured processor configs and talker weights with an `audio.*` prefix, while vllm-omni's MingFlashOmniForConditionalGeneration registers Qwen2VLImageProcessor + MingWhisperFeatureExtractor and expects `audio_vae.*` for the talker. The fix is to build a hybrid snapshot — inclusionAI's thinker safetensors (the only heavy bit, ~200 GB) plus Jonathan1909's repackaged metadata files + talker weights (~3 GB extra). This avoids re-downloading the thinker. Adds the explicit launch + benchmark recipe to benchmark/vllm_omni_instructions.md, including the served-model-id quirk (vllm-omni reports the local serve path verbatim and 404s on the canonical HF id) and a results table from a local 4×H100 run on 2026-06-06: T2T offline B=1: 110 tok/s T2T closed-loop C=8: 493 tok/s T2S: RTF 0.14 (real-time factor; <1 = faster than real-time) I2T + A2T both validated end-to-end.
Adds results/ming_t2t_sweep/SUMMARY.md with the throughput curve from a 6-point concurrency sweep on 4×H100 against the running vllm-omni hybrid-snapshot Ming server: c=1 → 110 tok/s (single-stream baseline) c=2 → 199 tok/s (1.8×) c=4 → 356 tok/s (3.2×) c=8 → 573 tok/s (5.2×) c=16 → 888 tok/s (8.1×) c=32 → 1060 tok/s (9.6×; knee here) All 470 requests across the sweep succeeded; TTFT stays 28-91 ms. benchmark/vllm_omni_instructions.md: expand the modalities-exercised table from the 4 modalities run in the previous session (T2T/I2T/A2T/ T2S) to all 8 omni paths (adds V2T/V2S/I2S/A2S, all green). Documents the direct-OpenAI-path workaround for V2T/V2S/A2S, used to sidestep UCF101 + LibriSpeech dataset downloads when disk is full.
Two small-N quality checks against the same vllm-omni Ming server used
for the throughput sweep:
MMLU 78.9% accuracy on 285 items (cais/mmlu, ~5 per subject, 0-shot)
VideoMME 56.9% accuracy on 51 items (chunk1 subset, stratified by
duration, 0-shot)
Both at temperature=0, parse rates ≥99%. MMLU runs in 13s (~22 req/s,
text-only); VideoMME takes ~10 min wall (~11 s/req, base64-inlined mp4s).
ACCURACY.md ships the per-subject (worst/best 10) and per-task-type
breakdowns. Notable: VideoMME medium-duration accuracy (29%) is much
lower than short (77%) or long (65%) — likely sample variance at N=17/
bucket, but flagged. Temporal Reasoning subtype 0/3 is also worth a
larger-sample follow-up.
These are spot checks, not publishable numbers; caveats are inlined in
ACCURACY.md. Per-item results.json files (gitignored) sit beside it
locally for drill-down.
Step 3a of mminf/model/ming_omni_flash/PORTING_NOTES.md. Adds the three
architecture-specific pieces of the Ling-2.0 thinker that don't map
cleanly onto mminf's existing components/, ahead of assembling the
full BailingMoeV2 decoder layer in step 3b:
components/router.py — LingMoeRouter:
* sigmoid + learned (non-grad) expert bias + group-limited top-k
(n_group=8 groups, topk_group=4) + routed_scaling_factor
* returns (logits, weights, indices) tuple so it drops straight into
mminf's SparseMoeBlockWithSharedExpert + the fused-Triton dispatch
components/rope.py — LingPartialMRotaryEmbedding:
* partial rotary (head_dim * 0.5 dims rotated, rest pass-through)
* 3D video_rope cos/sin remap [H W H W ... T T T] — the unusual
interleaving Ming uses instead of standard MRoPE's contiguous
[T T H H W W] layout
* degenerates to plain 1D rotary on 1D position_ids
components/attention.py — LingAttention:
* per-head RMSNorm on q and k before rope (use_qk_norm: True on the
released ckpt — standard ParallelAttention doesn't bake this in)
* composes the rope module + GQA + causal SDPA
* step-3a scope is batch=1 unit-test; full TP path lands step 3b
test/modular/test_ming_flash_omni_components.py — 12 tests:
* router: shapes/scaling, group-limit isolation, expert-bias shift,
bad-config rejection, vllm-omni indices cross-check (skip when
vllm-omni not importable in venv)
* rope: shapes + pass-through, 1D = plain rotary, video_rope axis
assignment (zero-row sentinel test), inconsistent-section rejection
* attention: forward runs (CUDA only — mminf RMSNorm uses flashinfer's
CUDA kernel), QK-norm produces unit-RMS output, causal mask doesn't
leak future tokens
Result: 11 component tests pass + 21 existing config/tokenizer tests
still green (32 total Ming tests). vllm-omni cross-check skips cleanly
in mminf's venv (vllm_omni is only installed in the vllm venv) and
when run manually requires a vllm config context that's non-trivial to
bootstrap outside vllm's own test harness.
Out of scope: BailingMoeV2DecoderLayer (hybrid dense/MoE per
first_k_dense_replace) — step 3b. BailingMoeV2Model + weight loader +
mminf submodule wiring — step 3c.
Step 3b of mminf/model/ming_omni_flash/PORTING_NOTES.md. Assembles the
step-3a components (LingMoeRouter, LingPartialMRotaryEmbedding,
LingAttention) into the layer and full-thinker forward.
Real find while reading upstream: Ling's MultiRouter isn't a single
grouped-topk router — it's THREE routers (text gate, image_gate,
audio_gate) mixed per-token by image/audio modality masks.
LingMoeRouter from step 3a is correct as the per-router primitive;
this step adds the multi-router composition around it.
components/moe.py — LingMoeBlock:
* 3 LingMoeRouter instances (gate / image_gate / audio_gate)
* Fused expert weights matching mminf SparseMoeBlock's packed layout
(gate_up_proj, down_proj) — step-3c weight loader can reuse the
existing primitives
* GatedMLP shared expert of moe_intermediate_size * num_shared_experts
width; output is added unconditionally and ungated (matches
upstream — no shared_expert_gate sigmoid trick)
* forward(hidden, image_mask=None, audio_mask=None): text gate runs
always, image/audio gates run + torch.where-swap their picks at
masked positions
components/decoder_layer.py — LingDecoderLayer:
* pre-norm pattern (RMSNorm + LingAttention + residual)
* branches on layer_idx: GatedMLP (intermediate_size=9216) when
layer_idx < first_k_dense_replace, else LingMoeBlock
* threads image_mask/audio_mask only to the MoE branch
components/model.py — LingMoeModel:
* Embed + ModuleList of N LingDecoderLayer + RMSNorm + lm_head
* Single shared LingPartialMRotaryEmbedding instance across layers
* forward accepts input_ids OR input_embeds (multimodal callers
will splice vision/audio embeds in step 4+), returns
(T, vocab_size) logits — no last-position slicing here
test/modular/test_ming_flash_omni_model.py — 9 tests:
* MoE block: text-only shape, image mask routes through image_gate,
shared expert contributes, bad-mask-shape rejection
* Model: input_ids/embeds XOR contract; full forward shape; embed
bypass; dense-vs-MoE layer-index branch differs; end-to-end causal
41 of 42 Ming tests passing (1 skipped: vllm-omni cross-check needs
vllm-omni in mminf venv; step 3a). Lint clean.
Out of scope (step 3c):
- KV cache wiring on LingAttention
- Safetensors weight loader (per-expert gate/up/down fusion across
256 separate keys into the packed gate_up_proj param)
- BailingMoeV2ThinkerSubmodule wrapping LingMoeModel for mminf's
engine/graph-walk machinery
- Real-checkpoint smoke test (load shard 1, run forward, verify
finite outputs against vllm-omni's output)
- TP-aware ParallelAttention/ParallelMoeBlock variants
Step 3c of mminf/model/ming_omni_flash/PORTING_NOTES.md. Maps the
released inclusionAI/Ming-flash-omni-2.0 checkpoint into the
LingMoeModel built in steps 3a + 3b, and verifies the load + forward
end-to-end against the real shards.
loader.py:
* _RENAME_RULES — 18 patterns mapping the ckpt's HF naming convention
(model.model.layers.{i}.attention.query_key_value.weight,
.mlp.gate.weight, .mlp.experts.{j}.gate_proj.weight, etc.) into
LingMoeModel's state_dict names (layers.{i}.self_attn.qkv_proj.weight,
.mlp.gate.gate.weight, .mlp.experts.gate_up_proj after fusion).
* build_ling_weight_converters() — reuses mminf's existing
MergeModulelist + Concatenate Operations to pack 256 per-expert
gate_proj/up_proj/down_proj weights per MoE layer into the dense
(256, 2*moe_inter, hidden) and (256, hidden, moe_inter) tensors
LingMoeBlock expects.
* load_thinker_weights(model, local_dir, device, strict=True) —
iterates shards via iter_safetensors_shards, applies the rename
pass, buckets per-expert weights per layer, runs the fusion
converters, and assigns to model.state_dict. Strict mode raises
on missing target params or unmatched ckpt keys; non-strict skips.
__init__.py — re-exports LingMoeModel and load_thinker_weights so
external callers can `from mminf.model.ming_omni_flash import ...`
without crawling into components/.
test_ming_flash_omni_loader.py — 6 tests:
* Pure-Python (always run): rename rules cover layer-0 dense keys,
rename rules cover MoE-layer keys, expert fusion produces
correctly-packed (256, 2*inter, hidden) tensor with gate/up halves
in expected positions, strict mode raises on missing params.
* Real-ckpt (CUDA + snapshot gated): load embed + dense layer 0 +
norm + lm_head from the released shards (~3 GB) into a 1-layer
LingMoeModel; forward 4 token ids returns (4, 157184) finite bf16
logits. Second test verifies every layer-0 attention parameter has
the expected shape after load.
49 of 50 Ming tests passing (1 skipped: vllm-omni router cross-check
needs vllm-omni in mminf venv; step 3a). Real-ckpt smoke confirms the
model-side code matches the upstream architecture: random tokens →
finite logits after embed + 1 dense transformer layer + lm_head, with
1024-dim packed QKV correctly split into Q (32×128) / K (4×128) /
V (4×128), and SDPA running on bf16 weights.
Out of scope (step 3d):
- KV cache wiring on LingAttention (currently uses inline SDPA;
needs mminf's cache_handle plumbing)
- BailingMoeV2ThinkerSubmodule in submodules.py — wraps LingMoeModel
into mminf's ARNodeSubmodule interface so the engine can drive it
- Full multi-layer forward verification against a vllm-omni-served
reference (the "byte-equality with upstream" test — needs all 32
layers loaded across multiple GPUs)
- TP-aware variants (ParallelAttention / ParallelMoeBlock + a
TP-rank-aware weight loader)
… (step 3d)
Step 3d of mminf/model/ming_omni_flash/PORTING_NOTES.md. Connects the
LingMoeModel built in 3a-3c to mminf's engine: wires KV cache through
attention, adds the submodule the engine calls, fills in every
MingFlashOmniModel ABC method for the text-only path.
components/attention.py — LingAttention now calls
cache_handle.run_attention(q, k, v) (paged KV write + masked SDPA via
FlashInfer) instead of inline F.scaled_dot_product_attention. Keeps
the custom partial-3D video_rope rotation inline (we don't use
cache_handle.apply_rope). Forward signature is now packed-tokens
(num_tokens, hidden) + cache_handle + position_ids — the layout the
mminf engine actually uses.
components/decoder_layer.py + components/model.py — thread cache_handle
through to attention; LingMoeModel.forward calls cache_handle.set_layer_idx(i)
before each layer's forward. cache_handle is the new first positional
arg of model.forward (everything after stays kwarg).
submodules.py (new) — BailingMoeV2ThinkerSubmodule wraps LingMoeModel
into mminf's ARNodeSubmodule contract: prepare_inputs builds
ARNodeInputs from token ids; preprocess plans the cache + packs the
batch (single-request only in 3d); forward runs the LingMoeModel +
advance_seq_lens; check_stop returns {"decode_loop"} when the
sampled token is <|role_end|> (id 156895). Mirrors Orpheus's text-LLM
template closely.
ming_omni_flash_model.py — removed the raise-NotImplementedError that
made the scaffold un-instantiable; implemented every Model ABC method
for the thinker text-only path: get_kv_cache_config (Ling-2.0 dims
from config.thinker_llm), get_node_engine_types ({"Thinker": KV_CACHE}),
get_graph_walk_graphs (prefill + decode_loop), get_partition_topology
(single Thinker partition), get_initial_forward_pass_args +
get_partition_forward_pass_args (mirrors Orpheus's prefill→decode→done
flow), process_prompt (jinja chat_template with the model's tokenizer
— OpenAI-standard "user" role works), postprocess (decode tokens to
utf-8), get_submodule (builds LingMoeModel + calls load_thinker_weights
+ returns BailingMoeV2ThinkerSubmodule).
configs/ming_flash_omni_thinker_only.yaml — simplified to register
only the Thinker node (audio_encoder/vision_encoder lands at step 4+).
Single-rank by default — TP=4 needs step-3e TP-aware variants.
Tests (test_ming_flash_omni_{components,model,loader}.py) — updated
to pass a _MockCacheHandle through every forward call. The mock
implements set_layer_idx + run_attention(SDPA-based) — the same
behavior the inline path had before the refactor, so test semantics
are unchanged. Real-ckpt smoke (step 3c's layer-0 forward through the
embed + 1 dense layer + lm_head) still produces finite bf16 logits
with the new signature.
End-to-end mminf-serve smoke (substep 4): mminf-serve --config
ming_flash_omni_thinker_only.yaml --tensor-comm-protocol SHM
successfully starts uvicorn, instantiates MingFlashOmniModel, calls
get_submodule("Thinker"), and starts loading weights via
load_thinker_weights — failing with OOM after ~75 GB on a single 80
GB H100. This is the expected blocker without TP-aware code: the
full 100B-param model needs TP=4 across 4 GPUs to fit. The engine
plumbing itself works end-to-end; step 3e (TP-aware ParallelAttention /
ParallelMoeBlock + TP-rank-aware weight loader) is the remaining
piece for actual serving.
47 of 48 Ming tests pass (1 skipped: vllm-omni router cross-check
needs vllm-omni in mminf venv from step 3a). Lint clean.
Step 3e of mminf/model/ming_omni_flash/PORTING_NOTES.md. Makes the
LingMoeModel TP-aware so the full 100B-param model actually fits
across multiple H100s (single-GPU OOMed at 75 GB in step 3d's smoke).
components/attention.py — LingAttention now wraps mminf's
QKVParallelLinear (per-rank head sharding, weight_loader handles
"q"/"k"/"v" shard_ids) + RowParallelLinear (all-reduces output dim).
Per-rank num_heads / num_kv_heads come from the qkv_proj after
construction. QK-norm + partial-3D video_rope stay inline (head_dim-
shaped operations identical at every rank).
components/moe.py — LingMoeBlock now allocates expert tensors with
shard_inter = moe_intermediate_size // tp_size, attaches mminf's
existing _gate_up_weight_loader / _down_proj_weight_loader (per-rank
slicing along the intermediate dim, shard_ids "gate:N"/"up:N"/"down:N"
per-expert). Shared expert becomes ParallelGatedMLP (its down_proj
all-reduces internally). TP>1 forward mirrors
ParallelSparseMoeBlock._dispatch_tp: fused_experts(reduce_results=False)
+ comm_group.all_reduce + moe_sum_reduce_triton.
components/decoder_layer.py + components/model.py — comm_group plumbed
through every constructor. Dense layer-0 MLP becomes ParallelGatedMLP.
loader.py — full refactor onto mminf's load_hf_weights + StackedParamRule
machinery (replaces step 3c's custom loader). New shape:
* _strip_outer_model_prefix + _apply_substring_renames + per-expert
__expertN__ marker rewrite in _remap_thinker_keys
* _split_packed_qkv splits the ckpt's packed query_key_value.weight
into three synthetic q_proj/k_proj/v_proj entries, which the
standard q/k/v StackedParamRules route into QKVParallelLinear's
fused qkv_proj
* _build_thinker_stacked_params dynamically builds 3 × num_experts
rules + dense MLP gate/up + synthetic QKV rules (770 total for
Ling-2.0's 256 experts)
Per-rank weight slicing is automatic via the parameter-attached
weight_loaders on every Parallel* module.
ming_omni_flash_model.py — _create_thinker_submodule (no longer in
inline get_submodule) constructs LingMoeModel(comm_group=tp_group) on
the meta device, .to_empty(device=device).to(bf16), then loads via
load_thinker_weights. get_default_sharding_config declares Thinker as
TP-capable. configs/ming_flash_omni_thinker_only.yaml: tp_size=8 on
GPUs 0-7 (TP=4 hit OOM at 78.58/80 GB; TP=8 has plenty of headroom).
Tests:
* components/model tests: switched to _init_dispatch_weights helper
that initialises every Parallel* param the constructor allocated
(Parallel* modules use torch.empty for params; real weight loading
overwrites them in production, tests need explicit init).
* test_ming_flash_omni_loader.py: rewritten for the new helpers
(_remap_thinker_keys, _build_thinker_stacked_params,
_split_packed_qkv). Real-ckpt smoke loads embed + 1 dense layer +
norm + lm_head and runs a forward — 1 layer's worth of finite
bf16 logits at vocab=157184.
47 of 48 Ming tests pass (1 skipped: vllm-omni router cross-check).
Lint clean.
End-to-end mminf-serve smoke (TP=8 on 8 H100s):
✅ uvicorn starts on :8092
✅ All 8 workers load 507 thinker params each (~50 sec total)
✅ KVCacheEngine warmup_and_capture + torch.compile applied
✅ Dedicated GPU threads + plan_executor spin up
❌ First /generate request: IndexError in
BailingMoeV2ThinkerSubmodule.prepare_inputs — per-request
text_inputs list arrives empty. Integration bug between
get_initial_forward_pass_args / graph walks / the conductor's
prompt-to-input-signals routing, NOT a model code bug. All the
heavy plumbing works; needs a small follow-up to wire the prompt
tokens through to the first prefill call. Documented in
PORTING_NOTES.md.
Out of scope (step 3f and step 4+):
- Fix the text_inputs-routing for the first prefill call (small but
needs a debug session walking the conductor → worker dispatch path)
- Multi-request batching in BailingMoeV2ThinkerSubmodule
- Vision / audio encoders + their prefill walks
- Talker / AudioVAE / image-gen
Closes two items from the mminf↔vllm-omni correctness review: * Add a parametrised numeric parity test for ``LingPartialMRotaryEmbedding._remap_video_rope`` vs vllm-omni's ``MingVideoRopeMRotaryEmbedding._remap_video_rope``. mminf operates on the full ``(3, T, rotary_dim)`` neox-cat table while vllm operates on the ``(3, T, rotary_dim/2)`` half table; both halves of our output must equal vllm's half output. 6 cases cover the released ckpt geometry (mrope_section=[8,12,12]) plus edges where hw_size==half (no temporal tail), hw_size<<half, and asymmetric Nh≠Nw. * Add the missing multimodal token IDs (``audio_patch_token``, ``audio_start_token``, ``audio_end_token``, ``image_end_token``, ``video_end_token``) and ``tokens_per_second`` to ``ThinkerLLMConfig`` with tokenizer-truth defaults. Without them, the vision/audio masking + MRoPE temporal-position pipeline (porting step 4) has nowhere to read these constants from. Also repair an upstream mislabel found while wiring those defaults: the inclusionAI ckpt's ``llm_config.video_start_token`` is 157159, but per the tokenizer 157159 is ``</image>`` and the real ``<video>`` token is 157160. Jonathan1909's patched config and vllm-omni's hardcoded default both have 157160. ``__post_init__`` now detects the bogus value, repairs it in place, and warns loudly so a future ckpt that intentionally rebinds the field doesn't get silently overridden. Extend the vocab-bounds validator to cover the five newly-added token fields and add regression tests for both behaviours. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two model-side bugs blocked the first end-to-end /generate response on
top of step 3e:
(a) BailingMoeV2ThinkerSubmodule had no postprocess hook, so the
decode loop's text_inputs edge never received the freshly sampled
new_token. Added postprocess that rebinds new_token -> text_inputs,
mirroring OrpheusLLMSubmodule.
(b) Prefill / decode output edges used EMPTY_DESTINATION +
conductor_new_token=True. With (a) fixed the loop produced tokens
but the API server received {"outputs": {}} because no edge routed
new_token to the client. Switched to Qwen3-Omni's pattern: emit
each token via parallel EMIT_TO_CLIENT (output_modality="text")
edges alongside the text_inputs loopback.
Also collected environment-side patches required to actually reach a
working forward on this box:
* BailingTokenizer doesn't load under transformers >= 5.0 (verbose
removed; add_bos_token setter touches not-yet-built _tokenizer).
_patch_bailing_tokenizer_for_transformers5 applies both fixes
lazily after the first AttributeError.
* LingMoeBlock._dispatch_tp now falls back to dispatch_experts_fused
+ all-reduce when sgl_kernel is unloadable, which is the case here
due to an ABI mismatch against the installed torch. Math is
equivalent (sum-over-TP and sum-over-top-k commute).
Verified via mminf-serve smoke (TP=8 on 8 H100s): /generate returns
real model text. Updated configs/ming_flash_omni_thinker_only.yaml
comments and PORTING_NOTES.md punch list accordingly.
Three new stateless modules with weight-key parity against the
released ckpt's top-level prefixes:
* MingVisionProjector / MingAudioProjector (components/projectors.py):
Port the nn.Sequential chains built inline in modeling_bailingmm2.py
into standalone modules. Layer indices match the on-disk keys
(linear_proj.{0,2} for vision, linear_proj_audio.{0,3} for audio).
* build_vision_encoder (components/vision_encoder.py):
Construct Ming's Qwen3MoeVisionTransformer via dynamic import from
the staged Ming source dir (the same path used by the tokenizer +
processor). The encoder is ~1 GB at bf16 and runs on a single GPU,
so we use the reference implementation directly rather than fork.
* MingAudioEncoder (components/audio_encoder.py):
Self-contained port of vllm-omni's packed-sequence Whisper encoder
(~250 LOC). No openai-whisper runtime dep — optional flash-attn
varlen fast path with a manual padded-attention fallback. Param
names match upstream Whisper (query/key/value/out,
mlp.{0,2}.{weight,bias}) so the released ckpt's audio.blocks.N.*
keys load by state-dict equality.
17 tests in test/modular/test_ming_flash_omni_encoders.py: 12
pure-Python (projector shapes/indices/forward, audio encoder weight-key
parity, packed-attention fallback) + 1 snapshot-gated (vision encoder
builds from real VisionEncoderConfig) + 1 CUDA-gated (forward smoke
under eager attention, currently skipped on this box for missing
libnvrtc-builtins — not a code bug; will re-verify when step 5 wires
encoders into the prefill walk).
PORTING_NOTES step 4a updated; 4b (extend loader.py to actually load
the vision/audio/projector subtrees from the snapshot) is the next
sub-step before the encoders can be wired into a live graph walk.
Adds four loader entry points on top of a shared
_load_prefixed_state_dict helper:
* load_vision_encoder_weights (prefix=vision.)
* load_audio_encoder_weights (prefix=audio.)
* load_vision_projector_weights (prefix=linear_proj., inner=proj.)
* load_audio_projector_weights (prefix=linear_proj_audio., inner=proj.)
None of these are TP-aware — vision + audio encoders colocate on
rank 0 in the typical topology (see configs/ming_flash_omni.yaml),
so plain prefix-strip + load_state_dict suffices. The projector
loaders prepend `proj.` so the on-disk linear_proj.{0,2}.* and
linear_proj_audio.{0,3}.* keys hit the nn.Sequential slot by
integer index.
Verified by 4 snapshot-gated tests against /dev/shm/ming-hybrid: all
four prefixes load strictly (no missing / unexpected keys). The audio
encoder's positional_embedding is loaded as a buffer (overrides the
local sinusoidal init); the vision encoder loads all 27 blocks +
merger + deepstack_merger_list cleanly.
Snapshot lookup in the test helper now prefers /dev/shm/ming-hybrid
(merged shards + index) over the HF-Hub snapshot dir (which only has
the index symlink — shards live elsewhere on this box).
Step 4a + 4b complete; step 5 (wire encoders into prefill graph
walks) is the next slice.
…n (step 5a) Add the two encoder NodeSubmodules and their construction paths so the Thinker can pull vision/audio embeddings off graph nodes once step 5b/5c land the prefill walks. * VisionEncoderSubmodule wraps Qwen3MoeVisionTransformer + MingVisionProjector and mirrors modeling_bailingmm2.extract_image_feature (encoder → projector → F.normalize). prepare_inputs raises clearly on missing pixel_values / image_grid_thw and promotes 1-D [T, H, W] grid_thw to (1, 3). * AudioEncoderSubmodule wraps MingAudioEncoder + MingAudioProjector. Accepts a single (n_mels, T) clip or (B, n_mels, T) batched tensor, optionally trims the padded tail using audio_seqlens, and concatenates per-clip embeddings along time. L2-norm applies when audio_config.norm_query_embeds is set (true on the released ckpt — matches modeling_bailingmm2.extract_audio_feature). * get_node_engine_types now registers vision_encoder and audio_encoder as EngineType.STATELESS alongside the KV-cache Thinker. Construction routes through _create_vision_encoder_submodule / _create_audio_encoder_submodule helpers that build, dtype-cast, and weight-load via the loaders from step 4b. flash_attention_2 is the default for the vision encoder (override via MING_VISION_ATTN_IMPL env var for non-FA2 dev boxes); audio encoder uses flash-attn varlen when available, manual fallback otherwise. 12 tests in test/modular/test_ming_flash_omni_submodules.py: 10 pure-Python (input validation, output shape, L2 norm, batched/single equivalence, audio_seqlens trim, grid_thw promotion, node-type registration, friendly error on unknown node) + 2 snapshot-gated (_create_audio_encoder_submodule end-to-end on the real ckpt — verifies Conv1 + projector params are non-zero post-load). PORTING_NOTES step 5 broken out into 5a (this), 5b (Thinker prefill dispatch for vision/audio modality routing), 5c (graph walks + partition wiring + initial-forward-pass arg routing).
…osition helpers (step 5b) BailingMoeV2ThinkerSubmodule.prepare_inputs now dispatches on graph_walk and emits either input_ids (text-only walks) or input_embeds + custom_pos_ids (multimodal walks). preprocess and forward route both shapes through to LingMoeModel's existing dual input_ids/input_embeds + 1D/3D position_ids handling — no new model.py path needed. Three new position-id helpers live in components/positions.py, each producing (3, T) long tensors compatible with LingPartialMRotaryEmbedding's video_rope branch: * get_rope_index_text — three identical sequential rows. Pure-text branch of modeling_bailing_moe_v2.get_rope_index (:658-675). * get_rope_index_audio — alias to text (Ming does not special-case audio in get_rope_index). * get_rope_index_vision — per-image 3D grid math from :625-647 with optional video timestamp scaling via second_per_grid_t * tokens_per_second. Thinker dispatch covers: * prefill / prefill_text — backward-compat text path (unchanged). * prefill_audio — wraps audio_embeds with audio_start / audio_end sentinel embeds, text-like 3D positions for the span. * prefill_vision / prefill_video — wraps vision_embeds with image_start/image_end (or video_start/video_end), grid-aware 3D positions. eos sentinel sits at global_max(vision_pos) + 1 so the next walk's text positions resume without collision (matches llm_pos_ids_list[-1].max() + 1 in the source). * decode / thinker_decode — single-token AR step (unchanged). Sentinel embeds are lazily computed per device on first use; the Thinker submodule now takes config= at construction so it can read vision.spatial_merge_size, thinker_llm.tokens_per_second, and the *_start_token / *_end_token ids. ming_omni_flash_model.py threads self.config through to the submodule. Step 5b restricts to single-image / single-clip requests; the multi-image splice via Sequential graph wiring lands in 5c. 21 new tests across test_ming_flash_omni_positions.py (11) and test_ming_flash_omni_submodules.py (10): position-id shape / offset / abs-time math, missing-input error paths, multi-image rejection, sentinel embed correctness for audio / image / video walks, start_pos advancement, legacy prefill walk name compat. All green.
get_graph_walk_graphs now returns five walks instead of the step 3f
text-only prefill/decode pair:
* prefill_text — bare Thinker node.
* prefill_audio — Sequential([audio_encoder, Thinker]); encoder emits
audio_embeds into the Thinker.
* prefill_vision — Sequential([vision_encoder, Thinker]);
image_grid_thw routes to BOTH the encoder (for spatial positions
on the patches) AND the Thinker (for 3D MRoPE math around the
vision span).
* prefill_video — same shape as prefill_vision plus
video_second_per_grid routed into the Thinker.
* thinker_decode — AR loop, renamed from step 3f's decode.
get_partitions lists all five walks under the single Thinker partition
with initial_walk="prefill_text".
Two new helpers drive scheduling:
* _build_thinker_prefill_schedule(input_modalities, input_signals) —
one schedule step per modality, in input_modalities order; each
step is (walk_name, {input_name: TensorPointerInfo}). Modalities
listed without matching tensors in input_signals are silently
skipped (parity with qwen3_omni).
* _get_thinker_prefill_inputs(metadata, input_signals) — emits one
GraphEdge per input for the current step, routing each to the right
node (encoder vs Thinker), including the dual image_grid_thw edge
for vision walks.
get_initial_forward_pass_args builds the schedule, picks the first
walk, and stashes the schedule + step counter on the metadata.
get_partition_forward_pass_args is the Thinker state machine: advance
schedule → transition to thinker_decode → return request_done=True
after the decode loop unwinds. Mirrors qwen3_omni_model.py:765+ minus
the Talker / Code2Wav partitions.
Empty-schedule edge case (no usable modalities) short-circuits to
request_done=True so the conductor doesn't hang.
21 tests in test/modular/test_ming_flash_omni_graph.py covering walk
structure, partition listing, schedule construction for all modality
mixes (incl. unknown-modality / no-inputs), per-walk edge routing,
and full state-machine drive across a text+audio request (init →
audio prefill → decode → done).
The submodule's backward-compat aliases for "prefill"/"decode" stay
in place so external callers that still emit the step 3f walk names
keep working.
… video (step 7) MingFlashOmniModel.process_prompt now produces the full NameToTensorList consumed by step 5c's prefill scheduler. Strategy mirrors qwen3_omni's process_prompt: apply the chat template to TEXT-ONLY messages (so the tokenizer doesn't insert placeholder tokens we'd later have to strip), then run image / video / audio sub-processors separately for each modality. Uses tokenizer.apply_chat_template (jinja, accepts OpenAI user/assistant/system roles) rather than the stricter processor.apply_chat_template (asserts on uppercase HUMAN/ASSISTANT only) — keeps the API surface OpenAI-compatible. Inputs (tensors: NameToTensorList): * image_inputs — list of CHW float [0,1] tensors per image. The internal _image_to_processor_input converts to HWC uint8 to avoid the upstream's double-rescale-to-zero bug. Single-channel inputs auto-broadcast to 3 channels. * audio_inputs — raw 1-D float tensors OR (waveform, sample_rate) tuples (sample rate inferred from processor default 16 kHz when raw waveform is passed). * video_inputs — list of (T, C, H, W) float tensors. Per-frame second_per_grid defaults to 1.0; override via kwargs["input_metadata"]["video"][i]["second_per_grid"]. Outputs (keys consumed by _build_thinker_prefill_schedule): * text_inputs — list of 1-D long tensors per text turn. * pixel_values, image_grid_thw — one entry per image. * pixel_values_videos, video_grid_thw, video_second_per_grid — per video clip. * audio_features (n_mels, T), audio_seqlens (length-1 long) — per audio clip. Upstream returns (B, T, n_mels); we transpose to (n_mels, T) per clip so AudioEncoderSubmodule.prepare_inputs can splice without a reshape. 17 tests in test/modular/test_ming_flash_omni_process_prompt.py covering text-only / no-prompt / image / audio / video / mixed paths, per-modality dispatch, missing-processor error paths, CHW-float→HWC-uint8 conversion correctness (including grayscale + uint8 pass-through), multi-image, video metadata override, plus a snapshot-gated text+image end-to-end against the real BailingMM2Processor. 16 green + 1 env-skip on this box. Image-gen <image><imagePatch>*256</image> query-token block deferred to step 9 (ImageGen partition; text-out generation works without it).
…groups Found during the first live mminf-serve bring-up of Ming-flash-omni (thinker-only config, TP=4 on GPUs 4-7). get_worker_graphs iterated EVERY graph walk, including prefill_audio / prefill_vision / prefill_video / talker, which reference encoder / talker nodes (audio_encoder, vision_encoder, Talker). The thinker-only deploy only declares `Thinker` in node_groups, so _divide_into_worker_graphs hit KeyError: 'audio_encoder' while dividing the prefill_audio walk and crashed conductor startup. Fix: in get_worker_graphs, collect the node names a walk references via graph.get_nodes() and skip any walk whose required nodes aren't all present in the config's node_groups. A partial deploy (thinker-only, talker-only, etc.) simply can't serve the walks for nodes it doesn't host — that's correct behaviour, not an error. This is generic framework behaviour (any model with optional partitions benefits), not Ming-specific. Verified: thinker-only conductor startup now proceeds past worker-graph division to weight loading (then OOMs at the documented TP=4 ~78.58/80 GB-per-rank wall, which is a hardware limit needing TP=8, not a code issue). test_ming_flash_omni_graph + talker_graph + test_graph all green; pre-existing test_worker_graphs_manager failures are unrelated (fail with this change stashed too).
…whitespace, E501)
Collaborator
Author
|
close for rebase in #115 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 1 — Ming-flash-omni: understanding path (thinker + vision/audio + multimodal prefill)
Base: noah_model_support · Head: noah_ming_understanding · 8 commits
Compare: noah_model_support...noah_ming_understanding