Fix Qwen3-Omni Talker sampler crash on FlashInfer 0.6.x by zhudianGG · Pull Request #139 · mstar-project/mstar

zhudianGG · 2026-06-20T20:16:02Z

Summary: Follow-up to #138. That PR fixed the Talker init-dtype crash but didn't exercise the Talker's CUDA-graph sampler, which still crashes on FlashInfer 0.6.x because the samplers pass tensor seed/offset (rejected by the reworked binding). Adds a one-time capability probe and routes graph paths to a scalar-int seed (not None, which trips an illegal current_seed() under capture).
Test plan: Launched qwen3omni_thinker_tp2 + TTS benchmark (seed_tts, concurrency 1–32, matched request counts 12/20/24/40/80/160). 0 failures, real audio, throughput within ±10% of reference at every level.

The Talker / CodePredictor samplers pass per-request philox seed/offset as int tensors, but FlashInfer 0.6.x reworked its sample-from-probs binding to accept only a scalar int (or torch.Generator) and rejects a tensor at the C/TVM-FFI layer ("Mismatched type on argument #7 ... ffi.Tensor"). This crashes the Talker during CUDA-graph decode capture, so TTS cannot run on FlashInfer 0.6.x even after the init-dtype fix (#138). Add _flashinfer_accepts_tensor_seed(): probe the binding once (during eager warmup, cached; never under capture) and route accordingly: - non-graph path (sample_tokens) coerces tensor seed/offset to scalar ints - CUDA-graph paths (sample_cuda_graphable_gpu, CudaGraphableSampler) pass a fixed scalar int, NOT None -- None makes FlashInfer fall back to its default CUDA generator, whose current_seed() read is illegal during graph capture. These paths are deterministic=True so the seed value does not affect captured output. Validated end-to-end: launched qwen3omni_thinker_tp2 and ran the TTS benchmark (text_to_speech, seed_tts, concurrency 1-32, request counts 12/20/24/40/80/160). Talker captures all graph configs, generates real audio with 0 failures, and throughput matches the reference within +-10% at every concurrency level.

Covers the Talker/CodePredictor sampler path that crashed on FlashInfer 0.6.x and had no automated coverage (the talker_prefill graph test only checks hidden-state determinism, not the decode sampler): - probe resolves to a concrete bool and caches - _coerce_seed_offset passes scalars/None through untouched - sample_tokens greedy with tensor seed returns the per-row argmax - sample_cuda_graphable_gpu CAPTURES + REPLAYS under a real torch.cuda.graph with tensor seed/offset buffers (the exact path that crashed) and the replayed tokens match the static input's argmax CUDA/FlashInfer-gated. Verified the graph-capture test fails on the pre-fix sampler ("Mismatched type on argument #7 ... ffi.Tensor").

zhudianGG marked this pull request as ready for review June 20, 2026 20:16

zhudianGG requested review from NSagan271 and merceod June 20, 2026 20:17

zhudianGG force-pushed the fix/qwen3omni-talker-sampler-flashinfer063 branch from 3ffd3f3 to 970dd34 Compare June 20, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Qwen3-Omni Talker sampler crash on FlashInfer 0.6.x#139

Fix Qwen3-Omni Talker sampler crash on FlashInfer 0.6.x#139
zhudianGG wants to merge 2 commits into
mainfrom
fix/qwen3omni-talker-sampler-flashinfer063

zhudianGG commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhudianGG commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant