feat(runtime): support multimodal VLM by chenht2022 · Pull Request #236 · lightseekorg/tokenspeed

chenht2022 · 2026-05-24T08:54:31Z

Summary

Adds VLM inference support for Kimi-K2.5 (MoonViT) and Qwen3.5 (with the Qwen3-VL vision tower). Multimodal inputs come from the SMG gateway (companion PR: lightseekorg/smg#1515) as precomputed pixel values + per-image content hashes — no in-engine HF preprocessing. SHM pixel transport uses explicit publish/attach/consume phases; MM-aware prefix caching is achieved by substituting placeholder ids with hash-derived pad_value at pad_input_tokens (cache module untouched); ViT CUDA-graph capture is budget-bucketed and shares the LM attn TP group via Mapping.

Test Plan

OCRBench end-to-end through the SMG gateway, both models, 8× B200, TP=8, 1000 requests, conc=64

Launch — TokenSpeed gRPC servicer (Qwen and Kimi differ only on --model, --served-model-name, --attention-backend):

# Qwen3.5-397B-NVFP4
TOKENSPEED_MM_SKIP_COMPUTE_HASH=1 \
TOKENSPEED_MM_ENABLE_ENCODER_CUDA_GRAPH=1 \
python -m smg_grpc_servicer.tokenspeed \
  --model <path-to-Qwen3.5-397B-A17B-NVFP4> \
  --served-model-name qwen3-vl-397b-nvfp4 \
  --attn-tp-size 8 --moe-tp-size 8 \
  --max-model-len 131072 --max-num-seqs 64 \
  --max-prefill-tokens 32768 --chunked-prefill-size 32768 \
  --gpu-memory-utilization 0.85 \
  --attention-backend trtllm \
  --moe-backend flashinfer_trtllm --sampling-backend flashinfer \
  --quantization nvfp4 --kv-cache-dtype fp8 --disable-kvstore \
  --host 127.0.0.1 --port 50051 --trust-remote-code

# Kimi-K2.5-NVFP4
TOKENSPEED_MM_SKIP_COMPUTE_HASH=1 \
TOKENSPEED_MM_ENABLE_ENCODER_CUDA_GRAPH=1 \
python -m smg_grpc_servicer.tokenspeed \
  --model <path-to-Kimi-K2.5-NVFP4> \
  --served-model-name kimi-k25-nvfp4 \
  --attn-tp-size 8 --moe-tp-size 8 \
  --max-model-len 131072 --max-num-seqs 64 \
  --max-prefill-tokens 32768 --chunked-prefill-size 32768 \
  --gpu-memory-utilization 0.85 \
  --attention-backend tokenspeed_mla \
  --moe-backend flashinfer_trtllm --sampling-backend flashinfer \
  --quantization nvfp4 --kv-cache-dtype fp8 --disable-kvstore \
  --host 127.0.0.1 --port 50051 --trust-remote-code

SMG gateway (in front of the TokenSpeed gRPC worker):

smg launch --worker-urls grpc://127.0.0.1:50051 \
           --host 127.0.0.1 --port 30000

Results:

Model	Accuracy	Wall (s)
Qwen3.5-397B-NVFP4	0.901	88.16
Kimi-K2.5-NVFP4	0.909	54.97

Companion PR

lightseekorg/smg#1515 — gateway-side change; must land lockstep (smg's tokenspeed-chat CI is currently red there because GenerateReqInput.precomputed_multimodal_inputs only exists once this PR lands).

Copilot

Pull request overview

This PR ports multimodal/VLM support into the TokenSpeed runtime, adding inference paths for Kimi-K2.5 (MoonViT) and Qwen3.5 (Qwen3-VL vision tower) using gateway-provided precomputed pixel tensors + content hashes (no in-engine HF preprocessing). It also introduces shared-memory (POSIX SHM) transport for cross-process feature tensors, M-RoPE position computation for Qwen-VL style models, and optional vision-encoder CUDA-graph capture.

Changes:

Add multimodal request structures, SHM feature transport, feature hashing, M-RoPE computation, and a vision embedding splice pipeline.
Integrate vision towers/models (Qwen3.5 visual tower + Kimi-K2.5 VLM) with backend-dispatched vision attention (FA3/FA4/Triton/FlashInfer cuDNN).
Wire multimodal context + M-RoPE position overrides through the engine execution pipeline (tokenization → scheduler IO → forward → detokenization).

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/triton/context.py	Adds Triton “context attention” kernel used by vision encoder attention backend.
tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/flashinfer/init.py	Exposes FlashInfer cuDNN prefill symbol for vision attention backend.
test/runtime/test_detokenizer_parity.py	Removes obsolete multimodal detokenizer placeholder/test.
python/tokenspeed/runtime/utils/server_args.py	Replaces mm_mode with --mm-attention-backend CLI option.
python/tokenspeed/runtime/utils/hf_transformers_utils.py	Adds KimiK25 config plumbing; removes AutoProcessor/get_processor.
python/tokenspeed/runtime/utils/env.py	Adds MM env flags and a global server-args accessor.
python/tokenspeed/runtime/utils/common.py	Expands image loading helpers; adds audio loading and list-flattening utility.
python/tokenspeed/runtime/multimodal/shm_transport.py	Introduces POSIX SHM tensor handle + sync pipeline across ranks.
python/tokenspeed/runtime/multimodal/mrope.py	Adds M-RoPE position computation + retraction extension helper.
python/tokenspeed/runtime/multimodal/inputs.py	Adds core multimodal data structures and SHM feature lifecycle helpers.
python/tokenspeed/runtime/multimodal/hash.py	Adds deterministic 64-bit hashing for multimodal feature dedup/caching.
python/tokenspeed/runtime/multimodal/encoder_cudagraph.py	Adds budget-bucketed CUDA graph capture/replay wrapper for vision encoders.
python/tokenspeed/runtime/multimodal/embedder.py	Adds vision embedding planning/encoding/scattering pipeline + pad-token substitution helper.
python/tokenspeed/runtime/models/qwen3_vision.py	Adds Qwen3-VL vision tower implementation used by Qwen3.5 models.
python/tokenspeed/runtime/models/qwen3_5.py	Integrates Qwen3.5 VLM path (vision tower + embedder + cudagraph hook).
python/tokenspeed/runtime/models/kimi_k25.py	Adds Kimi-K2.5 VLM (DeepSeekV3 LM + MoonViT tower) with embedder + cudagraph hook.
python/tokenspeed/runtime/layers/rotary_embedding.py	Adds interleaved M-RoPE support and native rotary apply helper.
python/tokenspeed/runtime/layers/conv.py	Adds optimized Conv2d/Conv3d layers for patch embeddings (unfold+linear fastpath).
python/tokenspeed/runtime/layers/attention/mm_encoder_attention.py	Adds cache-less vision attention layer and backend dispatch table.
python/tokenspeed/runtime/execution/model_runner.py	Plumbs multimodal_context through model forward invocation.
python/tokenspeed/runtime/execution/model_executor.py	Adds M-RoPE position override buffer logic and installs encoder CUDA-graph wrapper.
python/tokenspeed/runtime/execution/input_buffer.py	Adds mrope_positions buffer and zeroing behavior.
python/tokenspeed/runtime/entrypoints/engine.py	Removes legacy image/audio/video args from public generate entrypoints.
python/tokenspeed/runtime/entrypoints/engine_base.py	Removes legacy image_data arg from base engine interface.
python/tokenspeed/runtime/engine/request_handler.py	Adds SHM attach/barrier/consume after broadcast in recv_reqs.
python/tokenspeed/runtime/engine/parallel_sampling.py	Deep-copies multimodal inputs per replica to avoid SHM handle reuse/unlink races.
python/tokenspeed/runtime/engine/output_processor.py	Removes unsupported BatchMultimodalOut plumbing.
python/tokenspeed/runtime/engine/io_struct.py	Replaces legacy multimodal inputs with precomputed_multimodal_inputs + unpadded ids fields.
python/tokenspeed/runtime/engine/input_processor.py	Consumes precomputed multimodal inputs, computes M-RoPE, and pads ids for cache friendliness.
python/tokenspeed/runtime/engine/generation_output_processor.py	Tracks unpadded prompt ids for detokenization and extends M-RoPE on retraction.
python/tokenspeed/runtime/engine/event_loop.py	Builds per-forward MultimodalForwardContext and triggers M-RoPE extension.
python/tokenspeed/runtime/engine/core_client.py	Updates docs/comments to remove multimodal batch outputs from detokenizer channel.
python/tokenspeed/runtime/engine/async_llm.py	Removes processor creation; publishes SHM features before scheduler send.
python/tokenspeed/runtime/distributed/mapping.py	Adds vision mapping colocated on attention TP group.
python/tokenspeed/runtime/distributed/comm_backend/custom_allreduce.py	Adds custom AR capture context helper for cudagraph capture.
python/tokenspeed/runtime/configs/qwen3_vision_config.py	Adds Qwen3-VL vision config type.
python/tokenspeed/runtime/configs/qwen3_5_config.py	Ensures vision_config dicts are converted to the proper config class.
python/tokenspeed/runtime/configs/model_config.py	Adds KimiK25 architecture support and improves MLA-config selection.
python/tokenspeed/runtime/configs/kimi_k25_config.py	Adds Kimi-K2.5 HF-compatible config types.
python/tokenspeed/runtime/configs/init.py	Exposes KimiK25Config in configs package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        sequence_lengths is not None
+        and max_seqlen is not None
+        and isinstance(cu_seqlens, torch.Tensor)
+    ), "flashinfer_cudnn needs sequence_lengths, max_seqlen, and packed indptrs"


+        if item.pad_value is None or not item.offsets:
+            continue


+    def publish(cls, tensor: torch.Tensor) -> ShmTensorHandle:
+        nbytes = tensor.numel() * tensor.element_size()
+        shm = shared_memory.SharedMemory(create=True, size=nbytes)
+        try:
+            shm_bytes = torch.frombuffer(shm.buf, dtype=torch.uint8)
+            shm_bytes.copy_(tensor.view(torch.uint8).reshape(-1))


+"""Multimodal request data structures used across processors and model adapters."""
+
+from __future__ import annotations
+
+import dataclasses


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a50019aaa5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T09:00:20Z

+    for item in mm_items:
+        if "image_grid_thw" in item.model_specific_data:
+            image_grid_thw = item.model_specific_data["image_grid_thw"]
+        if "video_grid_thw" in item.model_specific_data:
+            video_grid_thw = item.model_specific_data["video_grid_thw"]


Aggregate all vision grids when computing M-RoPE positions

Collecting image_grid_thw / video_grid_thw by simple reassignment keeps only the last multimodal item's grid, but MRotaryEmbedding.get_rope_index indexes these arrays once per image/video token group. For prompts containing multiple images/videos, this produces incorrect position IDs (or index errors once image_index/video_index advances), which directly degrades or breaks VLM inference for multi-item inputs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-24T09:00:20Z

+            # precomputed_multimodal_inputs is a single prompt's MM; the SMG
+            # path only clears is_single via n>1 (batch_size == 1), so all n
+            # parallel samples correctly share it. Without this the image is
+            # silently dropped on the n>1 fan-out (placeholders -> text path).
+            precomputed_multimodal_inputs=self.precomputed_multimodal_inputs,


Preserve per-request multimodal inputs when splitting batches

When a batched GenerateReqInput is split via __getitem__, every sub-request reuses the same precomputed_multimodal_inputs object instead of selecting per-index multimodal data. In a true batch of different multimodal prompts, this aliases all requests to one image/video payload, so offsets/features no longer match each prompt’s input_ids, leading to wrong embeddings and incorrect outputs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7dce5423e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T09:09:28Z

+
+    # Continue the incremental sequence from the last input position.
+    last_position = mrope_positions[:, -1]  # (3,)
+    start_pos = last_position[0] + 1


Start retracted M-RoPE extension from global max position

When extending mrope_positions for a retracted request, the new text-token range is seeded from last_position[0] + 1, but M-RoPE continuation should start after the maximum position across all three axes. If a prompt ends in vision tokens (so axis tails differ), this produces smaller-than-valid positions on some axes and incorrect RoPE indices for resumed decoding after retraction.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-24T09:09:28Z

+                if item.encoded is not None:
+                    canonical = item
+                elif item.hash is not None and item.hash in canonical_by_hash:
+                    canonical = canonical_by_hash[item.hash]


Reuse encoded canonical item for same-hash duplicates

Within _plan, an item that is already encoded is not inserted into canonical_by_hash. If that encoded item appears first and another request in the same batch carries the same hash, the duplicate misses deduplication and is scheduled for a redundant encoder run. This regresses multimodal prefill latency in mixed batches where cached/chunked requests and new same-image requests coexist.

Useful? React with 👍 / 👎.

Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>

lightseek-bot · 2026-05-25T06:22:19Z

https://github.com/lightseekorg/tokenspeed/actions/runs/26357314869

lightseek-bot · 2026-05-25T06:22:58Z

Hi @lightseekorg/code-owner Let's merge this first. We can refine it in a follow-up.

Copilot AI review requested due to automatic review settings May 24, 2026 08:54

chenht2022 requested a review from a team as a code owner May 24, 2026 08:54

Copilot started reviewing on behalf of chenht2022 May 24, 2026 08:54 View session

zhyncs requested review from FlamingoPg, minedec and tuanzhangCS May 24, 2026 08:59

chenht2022 changed the title ~~feat(runtime): port multimodal VLM support~~ feat(runtime): support multimodal VLM May 24, 2026

Copilot AI reviewed May 24, 2026

View reviewed changes

zhyncs added the high priority label May 24, 2026

zhyncs requested review from syuoni and yweng0828 May 24, 2026 09:00

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

chenht2022 force-pushed the hongtaoc/support-vlm branch from a50019a to b7dce54 Compare May 24, 2026 09:01

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

feat(runtime): port multimodal VLM support

3859973

Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>

chenht2022 force-pushed the hongtaoc/support-vlm branch from b7dce54 to 3859973 Compare May 24, 2026 09:15

Merge branch 'main' into hongtaoc/support-vlm

69e6c1d

lightseek-bot approved these changes May 25, 2026

View reviewed changes

lightseek-bot merged commit 85f664d into main May 25, 2026
51 of 57 checks passed

lightseek-bot deleted the hongtaoc/support-vlm branch May 25, 2026 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime): support multimodal VLM #236

feat(runtime): support multimodal VLM #236
lightseek-bot merged 2 commits into
mainfrom
hongtaoc/support-vlm

chenht2022 commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Uh oh!

lightseek-bot commented May 25, 2026

Uh oh!

lightseek-bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chenht2022 commented May 24, 2026

Summary

Test Plan

Companion PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

lightseek-bot commented May 25, 2026

Uh oh!

lightseek-bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants