feat(runtime): support multimodal VLM #236
Conversation
There was a problem hiding this comment.
Pull request overview
This PR ports multimodal/VLM support into the TokenSpeed runtime, adding inference paths for Kimi-K2.5 (MoonViT) and Qwen3.5 (Qwen3-VL vision tower) using gateway-provided precomputed pixel tensors + content hashes (no in-engine HF preprocessing). It also introduces shared-memory (POSIX SHM) transport for cross-process feature tensors, M-RoPE position computation for Qwen-VL style models, and optional vision-encoder CUDA-graph capture.
Changes:
- Add multimodal request structures, SHM feature transport, feature hashing, M-RoPE computation, and a vision embedding splice pipeline.
- Integrate vision towers/models (Qwen3.5 visual tower + Kimi-K2.5 VLM) with backend-dispatched vision attention (FA3/FA4/Triton/FlashInfer cuDNN).
- Wire multimodal context + M-RoPE position overrides through the engine execution pipeline (tokenization → scheduler IO → forward → detokenization).
Reviewed changes
Copilot reviewed 40 out of 40 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/triton/context.py | Adds Triton “context attention” kernel used by vision encoder attention backend. |
| tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/flashinfer/init.py | Exposes FlashInfer cuDNN prefill symbol for vision attention backend. |
| test/runtime/test_detokenizer_parity.py | Removes obsolete multimodal detokenizer placeholder/test. |
| python/tokenspeed/runtime/utils/server_args.py | Replaces mm_mode with --mm-attention-backend CLI option. |
| python/tokenspeed/runtime/utils/hf_transformers_utils.py | Adds KimiK25 config plumbing; removes AutoProcessor/get_processor. |
| python/tokenspeed/runtime/utils/env.py | Adds MM env flags and a global server-args accessor. |
| python/tokenspeed/runtime/utils/common.py | Expands image loading helpers; adds audio loading and list-flattening utility. |
| python/tokenspeed/runtime/multimodal/shm_transport.py | Introduces POSIX SHM tensor handle + sync pipeline across ranks. |
| python/tokenspeed/runtime/multimodal/mrope.py | Adds M-RoPE position computation + retraction extension helper. |
| python/tokenspeed/runtime/multimodal/inputs.py | Adds core multimodal data structures and SHM feature lifecycle helpers. |
| python/tokenspeed/runtime/multimodal/hash.py | Adds deterministic 64-bit hashing for multimodal feature dedup/caching. |
| python/tokenspeed/runtime/multimodal/encoder_cudagraph.py | Adds budget-bucketed CUDA graph capture/replay wrapper for vision encoders. |
| python/tokenspeed/runtime/multimodal/embedder.py | Adds vision embedding planning/encoding/scattering pipeline + pad-token substitution helper. |
| python/tokenspeed/runtime/models/qwen3_vision.py | Adds Qwen3-VL vision tower implementation used by Qwen3.5 models. |
| python/tokenspeed/runtime/models/qwen3_5.py | Integrates Qwen3.5 VLM path (vision tower + embedder + cudagraph hook). |
| python/tokenspeed/runtime/models/kimi_k25.py | Adds Kimi-K2.5 VLM (DeepSeekV3 LM + MoonViT tower) with embedder + cudagraph hook. |
| python/tokenspeed/runtime/layers/rotary_embedding.py | Adds interleaved M-RoPE support and native rotary apply helper. |
| python/tokenspeed/runtime/layers/conv.py | Adds optimized Conv2d/Conv3d layers for patch embeddings (unfold+linear fastpath). |
| python/tokenspeed/runtime/layers/attention/mm_encoder_attention.py | Adds cache-less vision attention layer and backend dispatch table. |
| python/tokenspeed/runtime/execution/model_runner.py | Plumbs multimodal_context through model forward invocation. |
| python/tokenspeed/runtime/execution/model_executor.py | Adds M-RoPE position override buffer logic and installs encoder CUDA-graph wrapper. |
| python/tokenspeed/runtime/execution/input_buffer.py | Adds mrope_positions buffer and zeroing behavior. |
| python/tokenspeed/runtime/entrypoints/engine.py | Removes legacy image/audio/video args from public generate entrypoints. |
| python/tokenspeed/runtime/entrypoints/engine_base.py | Removes legacy image_data arg from base engine interface. |
| python/tokenspeed/runtime/engine/request_handler.py | Adds SHM attach/barrier/consume after broadcast in recv_reqs. |
| python/tokenspeed/runtime/engine/parallel_sampling.py | Deep-copies multimodal inputs per replica to avoid SHM handle reuse/unlink races. |
| python/tokenspeed/runtime/engine/output_processor.py | Removes unsupported BatchMultimodalOut plumbing. |
| python/tokenspeed/runtime/engine/io_struct.py | Replaces legacy multimodal inputs with precomputed_multimodal_inputs + unpadded ids fields. |
| python/tokenspeed/runtime/engine/input_processor.py | Consumes precomputed multimodal inputs, computes M-RoPE, and pads ids for cache friendliness. |
| python/tokenspeed/runtime/engine/generation_output_processor.py | Tracks unpadded prompt ids for detokenization and extends M-RoPE on retraction. |
| python/tokenspeed/runtime/engine/event_loop.py | Builds per-forward MultimodalForwardContext and triggers M-RoPE extension. |
| python/tokenspeed/runtime/engine/core_client.py | Updates docs/comments to remove multimodal batch outputs from detokenizer channel. |
| python/tokenspeed/runtime/engine/async_llm.py | Removes processor creation; publishes SHM features before scheduler send. |
| python/tokenspeed/runtime/distributed/mapping.py | Adds vision mapping colocated on attention TP group. |
| python/tokenspeed/runtime/distributed/comm_backend/custom_allreduce.py | Adds custom AR capture context helper for cudagraph capture. |
| python/tokenspeed/runtime/configs/qwen3_vision_config.py | Adds Qwen3-VL vision config type. |
| python/tokenspeed/runtime/configs/qwen3_5_config.py | Ensures vision_config dicts are converted to the proper config class. |
| python/tokenspeed/runtime/configs/model_config.py | Adds KimiK25 architecture support and improves MLA-config selection. |
| python/tokenspeed/runtime/configs/kimi_k25_config.py | Adds Kimi-K2.5 HF-compatible config types. |
| python/tokenspeed/runtime/configs/init.py | Exposes KimiK25Config in configs package. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sequence_lengths is not None | ||
| and max_seqlen is not None | ||
| and isinstance(cu_seqlens, torch.Tensor) | ||
| ), "flashinfer_cudnn needs sequence_lengths, max_seqlen, and packed indptrs" |
| if item.pad_value is None or not item.offsets: | ||
| continue |
| def publish(cls, tensor: torch.Tensor) -> ShmTensorHandle: | ||
| nbytes = tensor.numel() * tensor.element_size() | ||
| shm = shared_memory.SharedMemory(create=True, size=nbytes) | ||
| try: | ||
| shm_bytes = torch.frombuffer(shm.buf, dtype=torch.uint8) | ||
| shm_bytes.copy_(tensor.view(torch.uint8).reshape(-1)) |
| """Multimodal request data structures used across processors and model adapters.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import dataclasses |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a50019aaa5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for item in mm_items: | ||
| if "image_grid_thw" in item.model_specific_data: | ||
| image_grid_thw = item.model_specific_data["image_grid_thw"] | ||
| if "video_grid_thw" in item.model_specific_data: | ||
| video_grid_thw = item.model_specific_data["video_grid_thw"] |
There was a problem hiding this comment.
Aggregate all vision grids when computing M-RoPE positions
Collecting image_grid_thw / video_grid_thw by simple reassignment keeps only the last multimodal item's grid, but MRotaryEmbedding.get_rope_index indexes these arrays once per image/video token group. For prompts containing multiple images/videos, this produces incorrect position IDs (or index errors once image_index/video_index advances), which directly degrades or breaks VLM inference for multi-item inputs.
Useful? React with 👍 / 👎.
| # precomputed_multimodal_inputs is a single prompt's MM; the SMG | ||
| # path only clears is_single via n>1 (batch_size == 1), so all n | ||
| # parallel samples correctly share it. Without this the image is | ||
| # silently dropped on the n>1 fan-out (placeholders -> text path). | ||
| precomputed_multimodal_inputs=self.precomputed_multimodal_inputs, |
There was a problem hiding this comment.
Preserve per-request multimodal inputs when splitting batches
When a batched GenerateReqInput is split via __getitem__, every sub-request reuses the same precomputed_multimodal_inputs object instead of selecting per-index multimodal data. In a true batch of different multimodal prompts, this aliases all requests to one image/video payload, so offsets/features no longer match each prompt’s input_ids, leading to wrong embeddings and incorrect outputs.
Useful? React with 👍 / 👎.
a50019a to
b7dce54
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7dce5423e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Continue the incremental sequence from the last input position. | ||
| last_position = mrope_positions[:, -1] # (3,) | ||
| start_pos = last_position[0] + 1 |
There was a problem hiding this comment.
Start retracted M-RoPE extension from global max position
When extending mrope_positions for a retracted request, the new text-token range is seeded from last_position[0] + 1, but M-RoPE continuation should start after the maximum position across all three axes. If a prompt ends in vision tokens (so axis tails differ), this produces smaller-than-valid positions on some axes and incorrect RoPE indices for resumed decoding after retraction.
Useful? React with 👍 / 👎.
| if item.encoded is not None: | ||
| canonical = item | ||
| elif item.hash is not None and item.hash in canonical_by_hash: | ||
| canonical = canonical_by_hash[item.hash] |
There was a problem hiding this comment.
Reuse encoded canonical item for same-hash duplicates
Within _plan, an item that is already encoded is not inserted into canonical_by_hash. If that encoded item appears first and another request in the same batch carries the same hash, the duplicate misses deduplication and is scheduled for a redundant encoder run. This regresses multimodal prefill latency in mixed batches where cached/chunked requests and new same-image requests coexist.
Useful? React with 👍 / 👎.
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>
b7dce54 to
3859973
Compare
|
Hi @lightseekorg/code-owner Let's merge this first. We can refine it in a follow-up. |
Summary
Adds VLM inference support for Kimi-K2.5 (MoonViT) and Qwen3.5 (with the Qwen3-VL vision tower). Multimodal inputs come from the SMG gateway (companion PR: lightseekorg/smg#1515) as precomputed pixel values + per-image content hashes — no in-engine HF preprocessing. SHM pixel transport uses explicit publish/attach/consume phases; MM-aware prefix caching is achieved by substituting placeholder ids with hash-derived
pad_valueatpad_input_tokens(cache module untouched); ViT CUDA-graph capture is budget-bucketed and shares the LM attn TP group viaMapping.Test Plan
Launch — TokenSpeed gRPC servicer (Qwen and Kimi differ only on
--model,--served-model-name,--attention-backend):SMG gateway (in front of the TokenSpeed gRPC worker):
smg launch --worker-urls grpc://127.0.0.1:50051 \ --host 127.0.0.1 --port 30000Results:
Companion PR
lightseekorg/smg#1515 — gateway-side change; must land lockstep (smg's
tokenspeed-chatCI is currently red there becauseGenerateReqInput.precomputed_multimodal_inputsonly exists once this PR lands).