Noah ming tp4 oom fix by zhudianGG · Pull Request #119 · mstar-project/mstar

zhudianGG · 2026-06-12T20:30:35Z

Summary

After the TP=4 OOM fix (#), the native M* Ming-flash-omni-2.0 thinker-only server loads weights and captures CUDA graphs on all 4 ranks, but then hangs forever before binding the HTTP port. The workers finish setup and enter their main loop, but the conductor never observes their SETUP_DONE messages, so the API server's readiness wait never returns and uvicorn never starts.

Observed on 4×H100 (configs/ming_flash_omni_thinker_only_tp4.yaml, CUDA_VISIBLE_DEVICES=4,5,6,7), 2026-06-12. Process sat idle for 56+ minutes; port 8000 never listened.

Impact

Blocks all native (--inference-system ours) Ming serving and the T2T benchmark sweep. The model is otherwise fully loaded and GPU-resident (~57–62 GB/rank), so this is purely a startup-handshake failure, not a model/compute problem.

Evidence

Log (/dev/shm/ming_native_tp4.log, frozen at last line):
[conductor] mstar.conductor.conductor: Conductor waiting for 4 worker(s) to finish setup
... (per-rank) loader: Loaded 507 unique target params into LingMoeModel(num_hidden_layers=32) ... (rank 0/4 .. 3/4)
... (per-rank) cuda_graph_runner: CudaGraphRunner[Thinker]: warmup_and_capture done.
... (per-rank) worker: Worker worker_N: engine runs on dedicated GPU thread
... (per-rank) worker: Worker worker_N: plan_executor enabled — speculative plan() pre-run

Notably absent: conductor's "Conductor: all 4 worker(s) ready" (conductor.py:1012) and the API server's "All workers ready" (entrypoint.py:210).

py-spy stacks (all 6 processes, while hung):

API server main (entrypoint.py parent) — waiting on conductor:
finalize_setup (mstar/api_server/entrypoint.py:216) # while True: sleep(0.01) until setup_done
main (mstar/api_server/entrypoint.py:703)
Conductor — waiting on workers:
_wait_for_workers_ready (mstar/conductor/conductor.py:1011) # while pending: sleep(0.01)
run (mstar/conductor/conductor.py:1017)
All 4 workers — already past setup, idle in main loop:
poll (zmq/sugar/poll.py:106)
wait_for_work (mstar/worker/communicator.py:83)
run (mstar/worker/worker.py:2324)

Why this is contradictory

In worker.py:run(), the worker sends SETUP_DONE to the conductor at line 1936, then logs "plan_executor enabled" at line ~1974. All 4 workers logged the plan_executor line, so all 4 executed the SETUP_DONE send. Yet the conductor's pending set (conductor.py:1004) is still non-empty — it discards a worker only on receiving ConductorMessageType.SETUP_DONE (conductor.py:1006-1007). So the messages were sent but not received: a ZMQ delivery/routing gap on the worker→conductor channel during startup.

Suspected cause (unverified)

The worker→conductor SETUP_DONE (mstar/worker/communicator.py send path) appears to be lost between the worker's send (worker.py:1936) and the conductor's get_all_new_messages() poll (conductor.py:1005). Candidates to check:

A ZMQ socket connect/bind race: if the conductor's receive socket isn't fully connected when the worker sends, the message can be dropped (PUSH/PUB-style sockets drop on no-peer; SNDHWM/late-joiner semantics).
Send ordering relative to the two barrier_all() fences at worker.py:1916 / 1930 (the message goes out right after the second barrier).
Whether SETUP_DONE shares a socket with main-loop traffic and is being consumed/misrouted elsewhere.

Not yet root-caused — flagging the handshake as the locus, not the fix.

May be nondeterministic

PORTING_NOTES records a prior TP=8 mstar-serve smoke that booted and answered /generate, so this handshake may be a startup race that doesn't always trigger (or is TP=4-specific). A first reproduction step is simply to relaunch a few times and see if it's intermittent.

Reproduction

cd
CONFIG=configs/ming_flash_omni_thinker_only_tp4.yaml GPUS=4,5,6,7
./scripts/ming_native/serve_ming_tp4.sh &

Wait; observe port 8000 never binds:

until curl -sf http://0.0.0.0:8000/health; do sleep 5; done # hangs

Confirm with py-spy:

py-spy dump --pid <conductor_pid> # stuck at conductor.py:1011
py-spy dump --pid <worker_pid> # idle at worker.py:2324 (past setup)

Environment

4×H100 80GB, ming_flash_omni_thinker_only_tp4.yaml (TP=4, ranks→GPUs 4–7)
Branch: noah_ming_tp4_oom_fix (on top of noah_ming_mstar_port)
Requires the TP=4 OOM fix (#) to reach this point; before it, startup died earlier at to_empty.

origin/main renamed the package mminf -> mstar and dropped the `new_tokens` parameter from Model.get_partition_forward_pass_args. This ports the complete Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series — understanding + talker + config-tail/image-gen, steps 1-10) onto the current main: * moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and rewrote all mminf -> mstar references (package + 23 modular tests + 2 configs + benchmark class). * re-applied the two framework edits onto main's versions: the registry entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that skips graph walks whose nodes aren't in the deploy's node_groups. * dropped the `new_tokens` arg from our get_partition_forward_pass_args override to match the new base signature, and from the test call sites. * fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is now a registered submodule + partition, so the "unknown" assertions use a genuinely unknown name). 356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not port regressions.

Found during live TP=8 bring-up of the mstar port: the KV-cache engine now calls submodule.prepare_inputs with a seen_token_mask kwarg (sampler token mask) in addition to pos_info. Our Thinker submodule's prepare_inputs had an explicit pos_info param but no **kwargs, so every /generate crashed with "unexpected keyword argument 'seen_token_mask'". Added **kwargs to BailingMoeV2ThinkerSubmodule.prepare_inputs (mirrors the peer models, e.g. qwen3_omni, which all absorb engine extras this way). The other four prepare_inputs overrides already had **kwargs. Regression test asserts prepare_inputs accepts seen_token_mask + arbitrary future kwargs without error.

The flashinfer sampler FFI on this box (0.6.2 cubin path) takes scalar int seed/offset; mstar's Sampler builds them as [batch] tensors (the >=0.6.6 API), crashing every /generate with "Mismatched type on argument #7 ... Expected int but got ffi.Tensor". Added _flashinfer_rng_scalars() to collapse the tensors to Python ints at both sampling call sites (top_p and top_k_top_p). Correctness: greedy decode (temperature=0) ignores RNG, so this is exact for the greedy path; for stochastic sampling it only affects cross-row RNG independence (reproducibility), not the distribution. Validated live: native mstar Ming-flash-omni server (TP=8, real 195GB ckpt) now generates correct text end-to-end ("What is the capital of France?" -> "The capital of France is Paris.").

…ode_loop) The Thinker submodule's check_stop returned {"decode_loop"} on EOS, but the graph declares the decode Loop as "thinker_decode_loop". On the EOS step the worker's dynamic-loop registry looked up the wrong name and crashed the rank with KeyError(NodeAndGraphWalk(node='decode_loop', graph_walk='thinker_decode')), wedging the whole TP group (no further generation, hung streams). Found during live TP=8 serving: short generations that didn't hit EOS within the smoke window worked, but any request reaching EOS killed a worker. Regression test pins check_stop's returned name to the actual graph Loop name so they can't drift again.

Pinned to ranks 0-3 for launch with CUDA_VISIBLE_DEVICES=4,5,6,7. TP=4 is dimensionally valid (32 heads/4, 4 KV/4, 256 experts/4 all divide), unlike TP=6 which the model's power-of-2 dims reject outright. Re-verified on the mstar tree (2026-06-12): TP=4 OOMs during weight load at 78.58/80 GB per rank even with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (worker rank failed: "CUDA out of memory. Tried to allocate 1024 MiB ... Process has 78.58 GiB in use"). Matches the prior finding. TP=8 remains the only layout that fits the 195 GB model.

MingFlashOmniModel.get_submodule built the LingMoeModel on the meta device, then called `model.to_empty(device)` followed by `model.to(bfloat16)`. The meta constructor's `torch.empty(...)` calls default to float32, so `to_empty` allocated every parameter in fp32 (2x the bf16 footprint) and only then cast down. That fp32 allocation peak hit ~78.5/80 GB per rank and OOM'd at TP=4 (traceback at to_empty -> empty_like). TP=8 only "fit" because the halved per-rank shard kept the fp32 peak just under 80 GB. Cast the meta model to bf16 BEFORE to_empty. Casting a meta module is metadata-only (no allocation), so to_empty then allocates directly in bf16. Verified on 4xH100: all 4 TP ranks load + capture CUDA graphs at ~57-62 GB/rank (rank 0 higher due to colocated encoders/embeddings), well under the 80 GB limit. This is purely a load-time fix; steady-state sharding was always correct (which is why vllm-omni/sglang-omni serve the same thinker at TP=4). Also corrects the now-stale "TP=4 OOMs" warnings in the three ming configs, which had misattributed the cause to loader streaming overhead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

zhudianGG and others added 6 commits June 11, 2026 14:31

zhudianGG marked this pull request as ready for review June 12, 2026 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noah ming tp4 oom fix#119

Noah ming tp4 oom fix#119
zhudianGG wants to merge 6 commits into
mainfrom
noah_ming_tp4_oom_fix

zhudianGG commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhudianGG commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Wait; observe port 8000 never binds:

Confirm with py-spy:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhudianGG commented Jun 12, 2026 •

edited

Loading