Skip to content

Noah ming tp4 oom fix#119

Open
zhudianGG wants to merge 6 commits into
mainfrom
noah_ming_tp4_oom_fix
Open

Noah ming tp4 oom fix#119
zhudianGG wants to merge 6 commits into
mainfrom
noah_ming_tp4_oom_fix

Conversation

@zhudianGG

@zhudianGG zhudianGG commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

After the TP=4 OOM fix (#), the native M* Ming-flash-omni-2.0 thinker-only server loads weights and captures CUDA graphs on all 4 ranks, but then hangs forever before binding the HTTP port. The workers finish setup and enter their main loop, but the conductor never observes their SETUP_DONE messages, so the API server's readiness wait never returns and uvicorn never starts.

Observed on 4×H100 (configs/ming_flash_omni_thinker_only_tp4.yaml, CUDA_VISIBLE_DEVICES=4,5,6,7), 2026-06-12. Process sat idle for 56+ minutes; port 8000 never listened.

Impact

Blocks all native (--inference-system ours) Ming serving and the T2T benchmark sweep. The model is otherwise fully loaded and GPU-resident (~57–62 GB/rank), so this is purely a startup-handshake failure, not a model/compute problem.

Evidence

Log (/dev/shm/ming_native_tp4.log, frozen at last line):
[conductor] mstar.conductor.conductor: Conductor waiting for 4 worker(s) to finish setup
... (per-rank) loader: Loaded 507 unique target params into LingMoeModel(num_hidden_layers=32) ... (rank 0/4 .. 3/4)
... (per-rank) cuda_graph_runner: CudaGraphRunner[Thinker]: warmup_and_capture done.
... (per-rank) worker: Worker worker_N: engine runs on dedicated GPU thread
... (per-rank) worker: Worker worker_N: plan_executor enabled — speculative plan() pre-run

Notably absent: conductor's "Conductor: all 4 worker(s) ready" (conductor.py:1012) and the API server's "All workers ready" (entrypoint.py:210).

py-spy stacks (all 6 processes, while hung):

  • API server main (entrypoint.py parent) — waiting on conductor:
    finalize_setup (mstar/api_server/entrypoint.py:216) # while True: sleep(0.01) until setup_done
    main (mstar/api_server/entrypoint.py:703)
  • Conductor — waiting on workers:
    _wait_for_workers_ready (mstar/conductor/conductor.py:1011) # while pending: sleep(0.01)
    run (mstar/conductor/conductor.py:1017)
  • All 4 workers — already past setup, idle in main loop:
    poll (zmq/sugar/poll.py:106)
    wait_for_work (mstar/worker/communicator.py:83)
    run (mstar/worker/worker.py:2324)

Why this is contradictory

In worker.py:run(), the worker sends SETUP_DONE to the conductor at line 1936, then logs "plan_executor enabled" at line ~1974. All 4 workers logged the plan_executor line, so all 4 executed the SETUP_DONE send. Yet the conductor's pending set (conductor.py:1004) is still non-empty — it discards a worker only on receiving ConductorMessageType.SETUP_DONE (conductor.py:1006-1007). So the messages were sent but not received: a ZMQ delivery/routing gap on the worker→conductor channel during startup.

Suspected cause (unverified)

The worker→conductor SETUP_DONE (mstar/worker/communicator.py send path) appears to be lost between the worker's send (worker.py:1936) and the conductor's get_all_new_messages() poll (conductor.py:1005). Candidates to check:

  • A ZMQ socket connect/bind race: if the conductor's receive socket isn't fully connected when the worker sends, the message can be dropped (PUSH/PUB-style sockets drop on no-peer; SNDHWM/late-joiner semantics).
  • Send ordering relative to the two barrier_all() fences at worker.py:1916 / 1930 (the message goes out right after the second barrier).
  • Whether SETUP_DONE shares a socket with main-loop traffic and is being consumed/misrouted elsewhere.

Not yet root-caused — flagging the handshake as the locus, not the fix.

May be nondeterministic

PORTING_NOTES records a prior TP=8 mstar-serve smoke that booted and answered /generate, so this handshake may be a startup race that doesn't always trigger (or is TP=4-specific). A first reproduction step is simply to relaunch a few times and see if it's intermittent.

Reproduction

cd
CONFIG=configs/ming_flash_omni_thinker_only_tp4.yaml GPUS=4,5,6,7
./scripts/ming_native/serve_ming_tp4.sh &

Wait; observe port 8000 never binds:

until curl -sf http://0.0.0.0:8000/health; do sleep 5; done # hangs

Confirm with py-spy:

py-spy dump --pid <conductor_pid> # stuck at conductor.py:1011
py-spy dump --pid <worker_pid> # idle at worker.py:2324 (past setup)

Environment

  • 4×H100 80GB, ming_flash_omni_thinker_only_tp4.yaml (TP=4, ranks→GPUs 4–7)
  • Branch: noah_ming_tp4_oom_fix (on top of noah_ming_mstar_port)
  • Requires the TP=4 OOM fix (#) to reach this point; before it, startup died earlier at to_empty.

zhudianGG and others added 6 commits June 11, 2026 14:31
origin/main renamed the package mminf -> mstar and dropped the `new_tokens`
parameter from Model.get_partition_forward_pass_args. This ports the complete
Ming-flash-omni-2.0 native implementation (the full noah_ming stacked PR series
— understanding + talker + config-tail/image-gen, steps 1-10) onto the current
main:

  * moved mminf/model/ming_omni_flash/ -> mstar/model/ming_omni_flash/ and
    rewrote all mminf -> mstar references (package + 23 modular tests + 2
    configs + benchmark class).
  * re-applied the two framework edits onto main's versions: the registry
    entry (MODEL_REGISTRY + HF_MODELS) and the get_worker_graphs fix that
    skips graph walks whose nodes aren't in the deploy's node_groups.
  * dropped the `new_tokens` arg from our get_partition_forward_pass_args
    override to match the new base signature, and from the test call sites.
  * fixed two stale tests that predated the 9b ImageGen wiring (ImageGen is
    now a registered submodule + partition, so the "unknown" assertions use a
    genuinely unknown name).

356 CPU/unit tests pass under the mstar tree; ruff clean. Remaining failures
are snapshot/real-checkpoint-gated (tokenizer load, real-weight forward), not
port regressions.
Found during live TP=8 bring-up of the mstar port: the KV-cache engine now
calls submodule.prepare_inputs with a seen_token_mask kwarg (sampler token
mask) in addition to pos_info. Our Thinker submodule's prepare_inputs had an
explicit pos_info param but no **kwargs, so every /generate crashed with
"unexpected keyword argument 'seen_token_mask'".

Added **kwargs to BailingMoeV2ThinkerSubmodule.prepare_inputs (mirrors the
peer models, e.g. qwen3_omni, which all absorb engine extras this way). The
other four prepare_inputs overrides already had **kwargs.

Regression test asserts prepare_inputs accepts seen_token_mask + arbitrary
future kwargs without error.
The flashinfer sampler FFI on this box (0.6.2 cubin path) takes scalar int
seed/offset; mstar's Sampler builds them as [batch] tensors (the >=0.6.6 API),
crashing every /generate with "Mismatched type on argument #7 ... Expected int
but got ffi.Tensor". Added _flashinfer_rng_scalars() to collapse the tensors to
Python ints at both sampling call sites (top_p and top_k_top_p).

Correctness: greedy decode (temperature=0) ignores RNG, so this is exact for
the greedy path; for stochastic sampling it only affects cross-row RNG
independence (reproducibility), not the distribution.

Validated live: native mstar Ming-flash-omni server (TP=8, real 195GB ckpt)
now generates correct text end-to-end ("What is the capital of France?" ->
"The capital of France is Paris.").
…ode_loop)

The Thinker submodule's check_stop returned {"decode_loop"} on EOS, but the
graph declares the decode Loop as "thinker_decode_loop". On the EOS step the
worker's dynamic-loop registry looked up the wrong name and crashed the rank
with KeyError(NodeAndGraphWalk(node='decode_loop', graph_walk='thinker_decode')),
wedging the whole TP group (no further generation, hung streams).

Found during live TP=8 serving: short generations that didn't hit EOS within
the smoke window worked, but any request reaching EOS killed a worker.

Regression test pins check_stop's returned name to the actual graph Loop name
so they can't drift again.
Pinned to ranks 0-3 for launch with CUDA_VISIBLE_DEVICES=4,5,6,7. TP=4 is
dimensionally valid (32 heads/4, 4 KV/4, 256 experts/4 all divide), unlike
TP=6 which the model's power-of-2 dims reject outright.

Re-verified on the mstar tree (2026-06-12): TP=4 OOMs during weight load at
78.58/80 GB per rank even with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
(worker rank failed: "CUDA out of memory. Tried to allocate 1024 MiB ...
Process has 78.58 GiB in use"). Matches the prior finding. TP=8 remains the
only layout that fits the 195 GB model.
MingFlashOmniModel.get_submodule built the LingMoeModel on the meta
device, then called `model.to_empty(device)` followed by
`model.to(bfloat16)`. The meta constructor's `torch.empty(...)` calls
default to float32, so `to_empty` allocated every parameter in fp32
(2x the bf16 footprint) and only then cast down. That fp32 allocation
peak hit ~78.5/80 GB per rank and OOM'd at TP=4 (traceback at
to_empty -> empty_like). TP=8 only "fit" because the halved per-rank
shard kept the fp32 peak just under 80 GB.

Cast the meta model to bf16 BEFORE to_empty. Casting a meta module is
metadata-only (no allocation), so to_empty then allocates directly in
bf16. Verified on 4xH100: all 4 TP ranks load + capture CUDA graphs at
~57-62 GB/rank (rank 0 higher due to colocated encoders/embeddings),
well under the 80 GB limit.

This is purely a load-time fix; steady-state sharding was always
correct (which is why vllm-omni/sglang-omni serve the same thinker at
TP=4). Also corrects the now-stale "TP=4 OOMs" warnings in the three
ming configs, which had misattributed the cause to loader streaming
overhead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zhudianGG zhudianGG marked this pull request as ready for review June 12, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant