Noah ming tp4 oom fix#119
Open
zhudianGG wants to merge 6 commits into
Open
Conversation
origin/main renamed the package mminf -> mstar and dropped the `new_tokens`
parameter from Model.get_partition_forward_pass_args. This ports the complete
Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series
— understanding + talker + config-tail/image-gen, steps 1-10) onto the current
main:
* moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and
rewrote all mminf -> mstar references (package + 23 modular tests + 2
configs + benchmark class).
* re-applied the two framework edits onto main's versions: the registry
entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that
skips graph walks whose nodes aren't in the deploy's node_groups.
* dropped the `new_tokens` arg from our get_partition_forward_pass_args
override to match the new base signature, and from the test call sites.
* fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is
now a registered submodule + partition, so the "unknown" assertions use a
genuinely unknown name).
356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures
are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not
port regressions.
Found during live TP=8 bring-up of the mstar port: the KV-cache engine now calls submodule.prepare_inputs with a seen_token_mask kwarg (sampler token mask) in addition to pos_info. Our Thinker submodule's prepare_inputs had an explicit pos_info param but no **kwargs, so every /generate crashed with "unexpected keyword argument 'seen_token_mask'". Added **kwargs to BailingMoeV2ThinkerSubmodule.prepare_inputs (mirrors the peer models, e.g. qwen3_omni, which all absorb engine extras this way). The other four prepare_inputs overrides already had **kwargs. Regression test asserts prepare_inputs accepts seen_token_mask + arbitrary future kwargs without error.
The flashinfer sampler FFI on this box (0.6.2 cubin path) takes scalar int seed/offset; mstar's Sampler builds them as [batch] tensors (the >=0.6.6 API), crashing every /generate with "Mismatched type on argument #7 ... Expected int but got ffi.Tensor". Added _flashinfer_rng_scalars() to collapse the tensors to Python ints at both sampling call sites (top_p and top_k_top_p). Correctness: greedy decode (temperature=0) ignores RNG, so this is exact for the greedy path; for stochastic sampling it only affects cross-row RNG independence (reproducibility), not the distribution. Validated live: native mstar Ming-flash-omni server (TP=8, real 195GB ckpt) now generates correct text end-to-end ("What is the capital of France?" -> "The capital of France is Paris.").
…ode_loop)
The Thinker submodule's check_stop returned {"decode_loop"} on EOS, but the
graph declares the decode Loop as "thinker_decode_loop". On the EOS step the
worker's dynamic-loop registry looked up the wrong name and crashed the rank
with KeyError(NodeAndGraphWalk(node='decode_loop', graph_walk='thinker_decode')),
wedging the whole TP group (no further generation, hung streams).
Found during live TP=8 serving: short generations that didn't hit EOS within
the smoke window worked, but any request reaching EOS killed a worker.
Regression test pins check_stop's returned name to the actual graph Loop name
so they can't drift again.
Pinned to ranks 0-3 for launch with CUDA_VISIBLE_DEVICES=4,5,6,7. TP=4 is dimensionally valid (32 heads/4, 4 KV/4, 256 experts/4 all divide), unlike TP=6 which the model's power-of-2 dims reject outright. Re-verified on the mstar tree (2026-06-12): TP=4 OOMs during weight load at 78.58/80 GB per rank even with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (worker rank failed: "CUDA out of memory. Tried to allocate 1024 MiB ... Process has 78.58 GiB in use"). Matches the prior finding. TP=8 remains the only layout that fits the 195 GB model.
MingFlashOmniModel.get_submodule built the LingMoeModel on the meta device, then called `model.to_empty(device)` followed by `model.to(bfloat16)`. The meta constructor's `torch.empty(...)` calls default to float32, so `to_empty` allocated every parameter in fp32 (2x the bf16 footprint) and only then cast down. That fp32 allocation peak hit ~78.5/80 GB per rank and OOM'd at TP=4 (traceback at to_empty -> empty_like). TP=8 only "fit" because the halved per-rank shard kept the fp32 peak just under 80 GB. Cast the meta model to bf16 BEFORE to_empty. Casting a meta module is metadata-only (no allocation), so to_empty then allocates directly in bf16. Verified on 4xH100: all 4 TP ranks load + capture CUDA graphs at ~57-62 GB/rank (rank 0 higher due to colocated encoders/embeddings), well under the 80 GB limit. This is purely a load-time fix; steady-state sharding was always correct (which is why vllm-omni/sglang-omni serve the same thinker at TP=4). Also corrects the now-stale "TP=4 OOMs" warnings in the three ming configs, which had misattributed the cause to loader streaming overhead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After the TP=4 OOM fix (#), the native M* Ming-flash-omni-2.0 thinker-only server loads weights and captures CUDA graphs on all 4 ranks, but then hangs forever before binding the HTTP port. The workers finish setup and enter their main loop, but the conductor never observes their SETUP_DONE messages, so the API server's readiness wait never returns and uvicorn never starts.
Observed on 4×H100 (configs/ming_flash_omni_thinker_only_tp4.yaml, CUDA_VISIBLE_DEVICES=4,5,6,7), 2026-06-12. Process sat idle for 56+ minutes; port 8000 never listened.
Impact
Blocks all native (--inference-system ours) Ming serving and the T2T benchmark sweep. The model is otherwise fully loaded and GPU-resident (~57–62 GB/rank), so this is purely a startup-handshake failure, not a model/compute problem.
Evidence
Log (/dev/shm/ming_native_tp4.log, frozen at last line):
[conductor] mstar.conductor.conductor: Conductor waiting for 4 worker(s) to finish setup
... (per-rank) loader: Loaded 507 unique target params into LingMoeModel(num_hidden_layers=32) ... (rank 0/4 .. 3/4)
... (per-rank) cuda_graph_runner: CudaGraphRunner[Thinker]: warmup_and_capture done.
... (per-rank) worker: Worker worker_N: engine runs on dedicated GPU thread
... (per-rank) worker: Worker worker_N: plan_executor enabled — speculative plan() pre-run
Notably absent: conductor's "Conductor: all 4 worker(s) ready" (conductor.py:1012) and the API server's "All workers ready" (entrypoint.py:210).
py-spy stacks (all 6 processes, while hung):
finalize_setup (mstar/api_server/entrypoint.py:216) # while True: sleep(0.01) until setup_done
main (mstar/api_server/entrypoint.py:703)
_wait_for_workers_ready (mstar/conductor/conductor.py:1011) # while pending: sleep(0.01)
run (mstar/conductor/conductor.py:1017)
poll (zmq/sugar/poll.py:106)
wait_for_work (mstar/worker/communicator.py:83)
run (mstar/worker/worker.py:2324)
Why this is contradictory
In worker.py:run(), the worker sends SETUP_DONE to the conductor at line 1936, then logs "plan_executor enabled" at line ~1974. All 4 workers logged the plan_executor line, so all 4 executed the SETUP_DONE send. Yet the conductor's pending set (conductor.py:1004) is still non-empty — it discards a worker only on receiving ConductorMessageType.SETUP_DONE (conductor.py:1006-1007). So the messages were sent but not received: a ZMQ delivery/routing gap on the worker→conductor channel during startup.
Suspected cause (unverified)
The worker→conductor SETUP_DONE (mstar/worker/communicator.py send path) appears to be lost between the worker's send (worker.py:1936) and the conductor's get_all_new_messages() poll (conductor.py:1005). Candidates to check:
Not yet root-caused — flagging the handshake as the locus, not the fix.
May be nondeterministic
PORTING_NOTES records a prior TP=8 mstar-serve smoke that booted and answered /generate, so this handshake may be a startup race that doesn't always trigger (or is TP=4-specific). A first reproduction step is simply to relaunch a few times and see if it's intermittent.
Reproduction
cd
CONFIG=configs/ming_flash_omni_thinker_only_tp4.yaml GPUS=4,5,6,7
./scripts/ming_native/serve_ming_tp4.sh &
Wait; observe port 8000 never binds:
until curl -sf http://0.0.0.0:8000/health; do sleep 5; done # hangs
Confirm with py-spy:
py-spy dump --pid <conductor_pid> # stuck at conductor.py:1011
py-spy dump --pid <worker_pid> # idle at worker.py:2324 (past setup)
Environment