Skip to content

feat(runtime): support multimodal VLM #236

Merged
lightseek-bot merged 2 commits into
mainfrom
hongtaoc/support-vlm
May 25, 2026
Merged

feat(runtime): support multimodal VLM #236
lightseek-bot merged 2 commits into
mainfrom
hongtaoc/support-vlm

Conversation

@chenht2022
Copy link
Copy Markdown
Contributor

Summary

Adds VLM inference support for Kimi-K2.5 (MoonViT) and Qwen3.5 (with the Qwen3-VL vision tower). Multimodal inputs come from the SMG gateway (companion PR: lightseekorg/smg#1515) as precomputed pixel values + per-image content hashes — no in-engine HF preprocessing. SHM pixel transport uses explicit publish/attach/consume phases; MM-aware prefix caching is achieved by substituting placeholder ids with hash-derived pad_value at pad_input_tokens (cache module untouched); ViT CUDA-graph capture is budget-bucketed and shares the LM attn TP group via Mapping.

Test Plan

  • OCRBench end-to-end through the SMG gateway, both models, 8× B200, TP=8, 1000 requests, conc=64

Launch — TokenSpeed gRPC servicer (Qwen and Kimi differ only on --model, --served-model-name, --attention-backend):

# Qwen3.5-397B-NVFP4
TOKENSPEED_MM_SKIP_COMPUTE_HASH=1 \
TOKENSPEED_MM_ENABLE_ENCODER_CUDA_GRAPH=1 \
python -m smg_grpc_servicer.tokenspeed \
  --model <path-to-Qwen3.5-397B-A17B-NVFP4> \
  --served-model-name qwen3-vl-397b-nvfp4 \
  --attn-tp-size 8 --moe-tp-size 8 \
  --max-model-len 131072 --max-num-seqs 64 \
  --max-prefill-tokens 32768 --chunked-prefill-size 32768 \
  --gpu-memory-utilization 0.85 \
  --attention-backend trtllm \
  --moe-backend flashinfer_trtllm --sampling-backend flashinfer \
  --quantization nvfp4 --kv-cache-dtype fp8 --disable-kvstore \
  --host 127.0.0.1 --port 50051 --trust-remote-code

# Kimi-K2.5-NVFP4
TOKENSPEED_MM_SKIP_COMPUTE_HASH=1 \
TOKENSPEED_MM_ENABLE_ENCODER_CUDA_GRAPH=1 \
python -m smg_grpc_servicer.tokenspeed \
  --model <path-to-Kimi-K2.5-NVFP4> \
  --served-model-name kimi-k25-nvfp4 \
  --attn-tp-size 8 --moe-tp-size 8 \
  --max-model-len 131072 --max-num-seqs 64 \
  --max-prefill-tokens 32768 --chunked-prefill-size 32768 \
  --gpu-memory-utilization 0.85 \
  --attention-backend tokenspeed_mla \
  --moe-backend flashinfer_trtllm --sampling-backend flashinfer \
  --quantization nvfp4 --kv-cache-dtype fp8 --disable-kvstore \
  --host 127.0.0.1 --port 50051 --trust-remote-code

SMG gateway (in front of the TokenSpeed gRPC worker):

smg launch --worker-urls grpc://127.0.0.1:50051 \
           --host 127.0.0.1 --port 30000

Results:

Model Accuracy Wall (s)
Qwen3.5-397B-NVFP4 0.901 88.16
Kimi-K2.5-NVFP4 0.909 54.97

Companion PR

lightseekorg/smg#1515 — gateway-side change; must land lockstep (smg's tokenspeed-chat CI is currently red there because GenerateReqInput.precomputed_multimodal_inputs only exists once this PR lands).

Copilot AI review requested due to automatic review settings May 24, 2026 08:54
@chenht2022 chenht2022 requested a review from a team as a code owner May 24, 2026 08:54
@chenht2022 chenht2022 changed the title feat(runtime): port multimodal VLM support feat(runtime): support multimodal VLM May 24, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports multimodal/VLM support into the TokenSpeed runtime, adding inference paths for Kimi-K2.5 (MoonViT) and Qwen3.5 (Qwen3-VL vision tower) using gateway-provided precomputed pixel tensors + content hashes (no in-engine HF preprocessing). It also introduces shared-memory (POSIX SHM) transport for cross-process feature tensors, M-RoPE position computation for Qwen-VL style models, and optional vision-encoder CUDA-graph capture.

Changes:

  • Add multimodal request structures, SHM feature transport, feature hashing, M-RoPE computation, and a vision embedding splice pipeline.
  • Integrate vision towers/models (Qwen3.5 visual tower + Kimi-K2.5 VLM) with backend-dispatched vision attention (FA3/FA4/Triton/FlashInfer cuDNN).
  • Wire multimodal context + M-RoPE position overrides through the engine execution pipeline (tokenization → scheduler IO → forward → detokenization).

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/triton/context.py Adds Triton “context attention” kernel used by vision encoder attention backend.
tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/flashinfer/init.py Exposes FlashInfer cuDNN prefill symbol for vision attention backend.
test/runtime/test_detokenizer_parity.py Removes obsolete multimodal detokenizer placeholder/test.
python/tokenspeed/runtime/utils/server_args.py Replaces mm_mode with --mm-attention-backend CLI option.
python/tokenspeed/runtime/utils/hf_transformers_utils.py Adds KimiK25 config plumbing; removes AutoProcessor/get_processor.
python/tokenspeed/runtime/utils/env.py Adds MM env flags and a global server-args accessor.
python/tokenspeed/runtime/utils/common.py Expands image loading helpers; adds audio loading and list-flattening utility.
python/tokenspeed/runtime/multimodal/shm_transport.py Introduces POSIX SHM tensor handle + sync pipeline across ranks.
python/tokenspeed/runtime/multimodal/mrope.py Adds M-RoPE position computation + retraction extension helper.
python/tokenspeed/runtime/multimodal/inputs.py Adds core multimodal data structures and SHM feature lifecycle helpers.
python/tokenspeed/runtime/multimodal/hash.py Adds deterministic 64-bit hashing for multimodal feature dedup/caching.
python/tokenspeed/runtime/multimodal/encoder_cudagraph.py Adds budget-bucketed CUDA graph capture/replay wrapper for vision encoders.
python/tokenspeed/runtime/multimodal/embedder.py Adds vision embedding planning/encoding/scattering pipeline + pad-token substitution helper.
python/tokenspeed/runtime/models/qwen3_vision.py Adds Qwen3-VL vision tower implementation used by Qwen3.5 models.
python/tokenspeed/runtime/models/qwen3_5.py Integrates Qwen3.5 VLM path (vision tower + embedder + cudagraph hook).
python/tokenspeed/runtime/models/kimi_k25.py Adds Kimi-K2.5 VLM (DeepSeekV3 LM + MoonViT tower) with embedder + cudagraph hook.
python/tokenspeed/runtime/layers/rotary_embedding.py Adds interleaved M-RoPE support and native rotary apply helper.
python/tokenspeed/runtime/layers/conv.py Adds optimized Conv2d/Conv3d layers for patch embeddings (unfold+linear fastpath).
python/tokenspeed/runtime/layers/attention/mm_encoder_attention.py Adds cache-less vision attention layer and backend dispatch table.
python/tokenspeed/runtime/execution/model_runner.py Plumbs multimodal_context through model forward invocation.
python/tokenspeed/runtime/execution/model_executor.py Adds M-RoPE position override buffer logic and installs encoder CUDA-graph wrapper.
python/tokenspeed/runtime/execution/input_buffer.py Adds mrope_positions buffer and zeroing behavior.
python/tokenspeed/runtime/entrypoints/engine.py Removes legacy image/audio/video args from public generate entrypoints.
python/tokenspeed/runtime/entrypoints/engine_base.py Removes legacy image_data arg from base engine interface.
python/tokenspeed/runtime/engine/request_handler.py Adds SHM attach/barrier/consume after broadcast in recv_reqs.
python/tokenspeed/runtime/engine/parallel_sampling.py Deep-copies multimodal inputs per replica to avoid SHM handle reuse/unlink races.
python/tokenspeed/runtime/engine/output_processor.py Removes unsupported BatchMultimodalOut plumbing.
python/tokenspeed/runtime/engine/io_struct.py Replaces legacy multimodal inputs with precomputed_multimodal_inputs + unpadded ids fields.
python/tokenspeed/runtime/engine/input_processor.py Consumes precomputed multimodal inputs, computes M-RoPE, and pads ids for cache friendliness.
python/tokenspeed/runtime/engine/generation_output_processor.py Tracks unpadded prompt ids for detokenization and extends M-RoPE on retraction.
python/tokenspeed/runtime/engine/event_loop.py Builds per-forward MultimodalForwardContext and triggers M-RoPE extension.
python/tokenspeed/runtime/engine/core_client.py Updates docs/comments to remove multimodal batch outputs from detokenizer channel.
python/tokenspeed/runtime/engine/async_llm.py Removes processor creation; publishes SHM features before scheduler send.
python/tokenspeed/runtime/distributed/mapping.py Adds vision mapping colocated on attention TP group.
python/tokenspeed/runtime/distributed/comm_backend/custom_allreduce.py Adds custom AR capture context helper for cudagraph capture.
python/tokenspeed/runtime/configs/qwen3_vision_config.py Adds Qwen3-VL vision config type.
python/tokenspeed/runtime/configs/qwen3_5_config.py Ensures vision_config dicts are converted to the proper config class.
python/tokenspeed/runtime/configs/model_config.py Adds KimiK25 architecture support and improves MLA-config selection.
python/tokenspeed/runtime/configs/kimi_k25_config.py Adds Kimi-K2.5 HF-compatible config types.
python/tokenspeed/runtime/configs/init.py Exposes KimiK25Config in configs package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sequence_lengths is not None
and max_seqlen is not None
and isinstance(cu_seqlens, torch.Tensor)
), "flashinfer_cudnn needs sequence_lengths, max_seqlen, and packed indptrs"
Comment on lines +105 to +106
if item.pad_value is None or not item.offsets:
continue
Comment on lines +49 to +54
def publish(cls, tensor: torch.Tensor) -> ShmTensorHandle:
nbytes = tensor.numel() * tensor.element_size()
shm = shared_memory.SharedMemory(create=True, size=nbytes)
try:
shm_bytes = torch.frombuffer(shm.buf, dtype=torch.uint8)
shm_bytes.copy_(tensor.view(torch.uint8).reshape(-1))
Comment on lines +21 to +25
"""Multimodal request data structures used across processors and model adapters."""

from __future__ import annotations

import dataclasses
@zhyncs zhyncs requested review from syuoni and yweng0828 May 24, 2026 09:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a50019aaa5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +61 to +65
for item in mm_items:
if "image_grid_thw" in item.model_specific_data:
image_grid_thw = item.model_specific_data["image_grid_thw"]
if "video_grid_thw" in item.model_specific_data:
video_grid_thw = item.model_specific_data["video_grid_thw"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Aggregate all vision grids when computing M-RoPE positions

Collecting image_grid_thw / video_grid_thw by simple reassignment keeps only the last multimodal item's grid, but MRotaryEmbedding.get_rope_index indexes these arrays once per image/video token group. For prompts containing multiple images/videos, this produces incorrect position IDs (or index errors once image_index/video_index advances), which directly degrades or breaks VLM inference for multi-item inputs.

Useful? React with 👍 / 👎.

Comment on lines +305 to +309
# precomputed_multimodal_inputs is a single prompt's MM; the SMG
# path only clears is_single via n>1 (batch_size == 1), so all n
# parallel samples correctly share it. Without this the image is
# silently dropped on the n>1 fan-out (placeholders -> text path).
precomputed_multimodal_inputs=self.precomputed_multimodal_inputs,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve per-request multimodal inputs when splitting batches

When a batched GenerateReqInput is split via __getitem__, every sub-request reuses the same precomputed_multimodal_inputs object instead of selecting per-index multimodal data. In a true batch of different multimodal prompts, this aliases all requests to one image/video payload, so offsets/features no longer match each prompt’s input_ids, leading to wrong embeddings and incorrect outputs.

Useful? React with 👍 / 👎.

@chenht2022 chenht2022 force-pushed the hongtaoc/support-vlm branch from a50019a to b7dce54 Compare May 24, 2026 09:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7dce5423e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


# Continue the incremental sequence from the last input position.
last_position = mrope_positions[:, -1] # (3,)
start_pos = last_position[0] + 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Start retracted M-RoPE extension from global max position

When extending mrope_positions for a retracted request, the new text-token range is seeded from last_position[0] + 1, but M-RoPE continuation should start after the maximum position across all three axes. If a prompt ends in vision tokens (so axis tails differ), this produces smaller-than-valid positions on some axes and incorrect RoPE indices for resumed decoding after retraction.

Useful? React with 👍 / 👎.

Comment on lines +242 to +245
if item.encoded is not None:
canonical = item
elif item.hash is not None and item.hash in canonical_by_hash:
canonical = canonical_by_hash[item.hash]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse encoded canonical item for same-hash duplicates

Within _plan, an item that is already encoded is not inserted into canonical_by_hash. If that encoded item appears first and another request in the same batch carries the same hash, the duplicate misses deduplication and is scheduled for a redundant encoder run. This regresses multimodal prefill latency in mixed batches where cached/chunked requests and new same-image requests coexist.

Useful? React with 👍 / 👎.

Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via
SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated.

Signed-off-by: chenht2022 <chenht2022@gmail.com>
@chenht2022 chenht2022 force-pushed the hongtaoc/support-vlm branch from b7dce54 to 3859973 Compare May 24, 2026 09:15
@lightseek-bot
Copy link
Copy Markdown
Contributor

@lightseek-bot
Copy link
Copy Markdown
Contributor

Hi @lightseekorg/code-owner Let's merge this first. We can refine it in a follow-up.

@lightseek-bot lightseek-bot merged commit 85f664d into main May 25, 2026
51 of 57 checks passed
@lightseek-bot lightseek-bot deleted the hongtaoc/support-vlm branch May 25, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants