feat(grpc): add TokenSpeed multimodal (VLM) input support by chenht2022 · Pull Request #1515 · lightseekorg/smg

chenht2022 · 2026-05-21T15:46:22Z

Description

Problem

TokenSpeed is wired into the SMG gateway for text generation, but its gRPC path carries no multimodal inputs: an image request routed to a TokenSpeed worker is rejected at the router ("TokenSpeed backend does not support multimodal inputs").

Solution

Extend the TokenSpeed gRPC path to carry the gateway's already-preprocessed multimodal tensors. The gateway performs per-model image preprocessing in crates/multimodal; the servicer reconstructs the tensors and sets precomputed_multimodal_inputs so the engine consumes them directly and skips its own preprocessing.

Changes

crates/grpc_client/proto/tokenspeed_scheduler.proto: add MultimodalInputs / TensorData / PlaceholderRange, GenerateRequest.mm_inputs, and GetModelInfoResponse.supports_vision.
crates/grpc_client/src/tokenspeed_scheduler.rs, model_gateway/src/routers/grpc/{client,multimodal,proto_wrapper}.rs: assemble and forward TokenSpeed multimodal requests via a new MultimodalData::TokenSpeed variant.
grpc_servicer/smg_grpc_servicer/tokenspeed/{servicer,server}.py: decode the precomputed multimodal payload into the engine's inputs; raise the gRPC message-size ceiling for large VLM pixel_values.

Test Plan

cargo +nightly fmt --all and cargo clippy --workspace --all-targets --all-features -- -D warnings pass; Python ruff format / ruff check are clean. (cargo test runs in CI.)
The Python proto stubs regenerate cleanly and the servicer imports and round-trips the new MultimodalInputs message.
Scoped to TokenSpeed: only the TokenSpeed router arm, proto, and servicer change; the other backends' assembly paths and the shared crates/multimodal crate are untouched.

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- End-to-end multimodal support for TokenSpeed: send preprocessed image/audio tensors with placeholder offset handling.
- Chat and messages requests can include multimodal payloads for vision/audio generation.
- Models now report a supports_vision flag so routers can route multimodal-capable models.
Chores
- Increased gRPC message size limits to handle large image/video payloads.

coderabbitai · 2026-05-21T15:46:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds TokenSpeed multimodal proto types and GetModelInfo vision flag; propagates mm_inputs through Rust client builders and gateway assembly; implements gateway->tokenspeed proto conversion and routing; servicer decodes precomputed tensor payloads (including bfloat16 handling) and increases gRPC message limits to 2 GiB.

Changes

TokenSpeed Multimodal End-to-End Support

Layer / File(s)	Summary
Proto schema: multimodal messages and fields `crates/grpc_client/proto/tokenspeed_scheduler.proto`	Adds `MultimodalInputs`, `TensorData`, `PlaceholderRange` message types and `mm_inputs` field to `GenerateRequest`; adds `supports_vision` boolean to `GetModelInfoResponse`.
Rust client request builders: mm_inputs wiring `crates/grpc_client/src/tokenspeed_scheduler.rs`	`build_generate_request_from_chat` and `build_generate_request_from_messages` now accept `multimodal_inputs` and populate `GenerateRequest.mm_inputs`; `build_plain_generate_request` explicitly sets `mm_inputs = None`.
Gateway multimodal assembly: TokenSpeed-specific routing and proto conversion `model_gateway/src/routers/grpc/proto_wrapper.rs`, `model_gateway/src/routers/grpc/multimodal.rs`	Adds `TokenSpeedMultimodalData` and `MultimodalData::TokenSpeed` variant; implements `into_proto` to create `tokenspeed::MultimodalInputs`; adds `assemble_tokenspeed` to serialize pixel/model-specific tensors and compute placeholder ranges; updates dispatch to route TokenSpeed to assembler and clears mm_inputs appropriately.
Gateway request builders: multimodal proto passing `model_gateway/src/routers/grpc/client.rs`, `model_gateway/src/routers/grpc/harmony/stages/request_building.rs`	`build_chat_request` and `build_messages_request` construct TokenSpeed multimodal proto from assembled multimodal data and pass it into client request builders; Harmony path passes explicit `None` where not wired.
Servicer: multimodal decoding and model info vision flag `grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`	`_build_generate_req` reconstructs precomputed multimodal inputs from `request.mm_inputs` using `_mm_inputs_from_proto` and passes them as `precomputed_multimodal_inputs`; `GetModelInfo` sets `supports_vision` from model config; adds `_tensor_from_proto` helper (supports bfloat16 reinterpretation and dtype casting).
Infrastructure: gRPC message size limits increased to 2GB `grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`	Server and warmup channel options updated to set `grpc.max_send_message_length` and `grpc.max_receive_message_length` to `(2 << 30) - 1` bytes.

Sequence Diagram(s)

sequenceDiagram
  participant ModelGateway
  participant GrpcClient
  participant TokenSpeedServicer._build_generate_req
  participant TokenSpeedServicer._mm_inputs_from_proto
  participant TokenSpeedServicer._tensor_from_proto
  participant Engine
  ModelGateway->>GrpcClient: send GenerateRequest(mm_inputs)
  GrpcClient->>TokenSpeedServicer._build_generate_req: gRPC Generate call
  _build_generate_req->>_mm_inputs_from_proto: pass mm_inputs proto
  _mm_inputs_from_proto->>_tensor_from_proto: decode TensorData -> tensors
  _tensor_from_proto->>Engine: provide tensors for engine call

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

multimodal

Suggested reviewers

key4ng
slin1237
CatherineSue

Poem

🐰 In buffers snug and shapes that softly hum,

Pixel hops join tokens, now together they come.
Gateway packs the tensors, the client sends the call,
Servicer decodes and wakes the engine for all.
A rabbit twitches whiskers — multimodal for all!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding TokenSpeed multimodal (VLM) input support to the gRPC integration.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces multimodal (text and image) support for the TokenSpeed scheduler. Key changes include updating gRPC proto definitions to include multimodal input structures, increasing gRPC message size limits to accommodate large image payloads, and implementing the necessary serialization and deserialization logic in both the Rust gateway and the Python servicer. A critical feedback point was raised regarding memory safety in the Python servicer, where tensors reconstructed from gRPC buffers could potentially alias transient memory, leading to corruption; a fix using explicit cloning was suggested.

gemini-code-assist · 2026-05-21T15:52:42Z

+        view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
+        if cast_to is not None and np.issubdtype(view.dtype, np.floating):
+            return torch.from_numpy(view).to(cast_to)
+        return torch.from_numpy(view.copy())


The current implementation of _tensor_from_proto can return a tensor that aliases the transient bytes buffer of the gRPC request message. Specifically, when cast_to is provided and matches the input dtype, torch.from_numpy(view).to(cast_to) returns a view of the original memory. If the proto message is garbage collected before the engine finishes processing the tensor, this will lead to memory corruption or crashes.

To ensure safety as stated in the docstring ("the buffer is copied so it never aliases the transient proto bytes"), you should always ensure a copy is made. Using t.clone() is an idiomatic way to achieve this in PyTorch.

Suggested change

view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)

if cast_to is not None and np.issubdtype(view.dtype, np.floating):

return torch.from_numpy(view).to(cast_to)

return torch.from_numpy(view.copy())

view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)

t = torch.from_numpy(view)

if cast_to is not None and t.dtype != cast_to and np.issubdtype(view.dtype, np.floating):

return t.to(cast_to)

return t.clone()

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 726-728: The conversion from (offset, length) to inclusive ranges
in the mm_inputs.mm_placeholders handling can produce invalid ranges when a
placeholder has length == 0 (end = offset - 1); validate each placeholder before
converting by checking p.length > 0 and if any zero-length placeholder is found,
abort the RPC with INVALID_ARGUMENT (use the servicer's gRPC context.abort or
equivalent) so malformed payloads fail fast; update the logic around the offsets
= [...] comprehension in servicer.py to perform this validation on
mm_inputs.mm_placeholders and only build offsets after confirming all lengths
are positive.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fca3ad61-f637-49f7-ba95-7be1c95f15d9

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed4954 and 5fad09e.

📒 Files selected for processing (8)

crates/grpc_client/proto/tokenspeed_scheduler.proto
crates/grpc_client/src/tokenspeed_scheduler.rs
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
model_gateway/src/routers/grpc/client.rs
model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs

Copilot

Pull request overview

Adds multimodal (vision) request forwarding for the TokenSpeed gRPC backend by extending the TokenSpeed proto to carry preprocessed tensors from the gateway and decoding them in the TokenSpeed Python servicer so the engine can consume precomputed_multimodal_inputs.

Changes:

Extend TokenSpeed scheduler proto with MultimodalInputs (tensor payload + placeholders) and supports_vision in GetModelInfoResponse.
Add a TokenSpeed-specific multimodal assembly path in the gateway and plumb mm_inputs through the TokenSpeed gRPC client request builders.
Decode multimodal tensors in the TokenSpeed servicer and raise gRPC message-size limits to accommodate large pixel tensors.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
model_gateway/src/routers/grpc/proto_wrapper.rs	Adds `MultimodalData::TokenSpeed` and converts TokenSpeed multimodal payloads into TokenSpeed proto types.
model_gateway/src/routers/grpc/multimodal.rs	Assembles TokenSpeed multimodal tensors/placeholders from the gateway’s intermediate representation.
model_gateway/src/routers/grpc/harmony/stages/request_building.rs	Updates TokenSpeed Harmony request building call site to pass the new multimodal argument (currently `None`).
model_gateway/src/routers/grpc/client.rs	Plumbs TokenSpeed multimodal proto into TokenSpeed request builders for chat/messages.
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py	Reconstructs engine multimodal inputs from proto and passes them via `precomputed_multimodal_inputs`.
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py	Raises gRPC max send/receive message sizes to support large VLM payloads.
crates/grpc_client/src/tokenspeed_scheduler.rs	Adds `mm_inputs` parameter support to TokenSpeed request builder APIs and populates the proto field.
crates/grpc_client/proto/tokenspeed_scheduler.proto	Defines multimodal messages (`TensorData`, `MultimodalInputs`, `PlaceholderRange`) and adds `supports_vision`.

Comments suppressed due to low confidence (1)

grpc_servicer/smg_grpc_servicer/tokenspeed/server.py:129

The warmup probe channel also sets grpc.max_send/receive_message_length to 2 GiB - 1. If you keep the higher ceiling, it should be driven by the same configuration as the server options; otherwise warmup and runtime can diverge and the higher limit still increases memory risk in the client process.

    channel = grpc.insecure_channel(
        grpc_url,
        options=[
            ("grpc.max_send_message_length", (2 << 30) - 1),
            ("grpc.max_receive_message_length", (2 << 30) - 1),
        ],

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if cast_to is not None and np.issubdtype(view.dtype, np.floating):
+            return torch.from_numpy(view).to(cast_to)
+        return torch.from_numpy(view.copy())


    server = grpc.aio.server(
        futures.ThreadPoolExecutor(max_workers=10),
        options=[
-            ("grpc.max_send_message_length", 1024 * 1024 * 256),
-            ("grpc.max_receive_message_length", 1024 * 1024 * 256),
+            # VLM pixel_values can be large; raise to gRPC's max (2 GiB - 1).
+            ("grpc.max_send_message_length", (2 << 30) - 1),
+            ("grpc.max_receive_message_length", (2 << 30) - 1),
            # Permissive keepalive so long prefill stalls don't trip GOAWAY.
            ("grpc.http2.min_recv_ping_interval_without_data_ms", 10000),
            ("grpc.keepalive_permit_without_calls", True),


coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)

726-728: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate placeholder length before converting to inclusive ranges.

(offset, length) → (start, end) can produce an invalid end when length == 0. Reject malformed placeholders as INVALID_ARGUMENT before building offsets.

Suggested fix

-        offsets = [(p.offset, p.offset + p.length - 1) for p in mm_inputs.mm_placeholders]
+        offsets = []
+        for p in mm_inputs.mm_placeholders:
+            if p.length == 0:
+                raise ValueError("mm_placeholders.length must be > 0")
+            offsets.append((p.offset, p.offset + p.length - 1))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 726 -
728, Before converting placeholders to inclusive ranges in the block that builds
offsets from mm_inputs.mm_placeholders, validate each placeholder's length
(p.length) to ensure it's > 0 and reject any malformed placeholder with a gRPC
INVALID_ARGUMENT response; specifically, check mm_inputs.mm_placeholders for any
p.length <= 0 and abort/return with grpc.StatusCode.INVALID_ARGUMENT and a clear
message before executing the offsets = [(p.offset, p.offset + p.length - 1) ...]
comprehension so you never produce an invalid end value.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 748-773: In _tensor_from_proto, validate tensor_data.dtype,
tensor_data.shape and tensor_data.data length before deserializing: ensure dtype
is one of the supported types (handle "bfloat16" specially), ensure shape is a
sequence of non-negative ints and compute the element count (product of shape,
treating empty shape as 1), determine the expected byte size (use numpy
dtype.itemsize for normal dtypes; for "bfloat16" use uint16/itemsize=2), and
compare expected_bytes to len(tensor_data.data) raising a clear ValueError if
mismatched or dtype unsupported; keep the existing bfloat16 path and casting
behavior but only proceed to np.frombuffer/torch.from_numpy after the checks
pass to avoid late failures or unexpected large allocations.

---

Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 726-728: Before converting placeholders to inclusive ranges in the
block that builds offsets from mm_inputs.mm_placeholders, validate each
placeholder's length (p.length) to ensure it's > 0 and reject any malformed
placeholder with a gRPC INVALID_ARGUMENT response; specifically, check
mm_inputs.mm_placeholders for any p.length <= 0 and abort/return with
grpc.StatusCode.INVALID_ARGUMENT and a clear message before executing the
offsets = [(p.offset, p.offset + p.length - 1) ...] comprehension so you never
produce an invalid end value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ac5b7923-bbbe-48d9-bb74-174937fb3e1b

📥 Commits

Reviewing files that changed from the base of the PR and between 5fad09e and abf25a3.

📒 Files selected for processing (8)

crates/grpc_client/proto/tokenspeed_scheduler.proto
crates/grpc_client/src/tokenspeed_scheduler.rs
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
model_gateway/src/routers/grpc/client.rs
model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)

750-772: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate tensor dtype and byte-size before np.frombuffer/reshape.

Malformed dtype/shape/data currently fails late and may surface as INTERNAL instead of a clear client error. Add explicit checks and raise ValueError for invalid payloads.

Proposed fix

     def _tensor_from_proto(
         tensor_data: tokenspeed_scheduler_pb2.TensorData,
         cast_to: torch.dtype | None = None,
     ):
@@
         import numpy as np
         import torch
 
         shape = list(tensor_data.shape)
+        numel = int(np.prod(shape, dtype=np.int64)) if shape else 1
 
         if tensor_data.dtype == "bfloat16":
+            expected = numel * 2
+            if len(tensor_data.data) != expected:
+                raise ValueError(
+                    f"bfloat16 tensor byte-size mismatch: got {len(tensor_data.data)}, expected {expected}"
+                )
             # numpy has no bfloat16 — read the raw bits as uint16, reinterpret.
             t = torch.from_numpy(
-                np.frombuffer(tensor_data.data, dtype=np.uint16).reshape(shape)
+                np.frombuffer(tensor_data.data, dtype=np.uint16, count=numel).reshape(shape)
             ).view(torch.bfloat16)
         else:
+            try:
+                np_dtype = np.dtype(tensor_data.dtype)
+            except TypeError as e:
+                raise ValueError(f"unsupported tensor dtype: {tensor_data.dtype}") from e
+            expected = numel * np_dtype.itemsize
+            if len(tensor_data.data) != expected:
+                raise ValueError(
+                    f"tensor byte-size mismatch: got {len(tensor_data.data)}, expected {expected}"
+                )
             t = torch.from_numpy(
-                np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
+                np.frombuffer(tensor_data.data, dtype=np_dtype, count=numel).reshape(shape)
             )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 750 -
772, The _tensor_from_proto function should validate tensor_data.dtype,
tensor_data.shape, and the byte-length of tensor_data.data before calling
np.frombuffer/reshape: ensure shape is a sequence of non-negative ints and
compute the element count (product of shape); map the declared dtype
(special-case "bfloat16" → uint16 element size 2) to a numpy dtype and determine
expected_bytes = element_size * element_count, then verify len(tensor_data.data)
== expected_bytes and raise a ValueError with a clear message if the dtype is
unknown, shape is invalid, or the byte length mismatches; perform these checks
before any np.frombuffer/reshape calls in _tensor_from_proto so malformed
payloads raise deterministic ValueError client errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 579-583: The handler currently treats a present mm_inputs without
pixel_values as text-only; instead, in the block that checks
request.HasField("mm_inputs") you should detect the malformed partial multimodal
payload (i.e., request.mm_inputs is present but not
request.mm_inputs.HasField("pixel_values")) and raise a ValueError so the RPC
returns INVALID_ARGUMENT; otherwise, continue to call
_mm_inputs_from_proto(request.mm_inputs,
model_dtype=getattr(self.async_llm.model_config, "dtype", None)) as before.

---

Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 750-772: The _tensor_from_proto function should validate
tensor_data.dtype, tensor_data.shape, and the byte-length of tensor_data.data
before calling np.frombuffer/reshape: ensure shape is a sequence of non-negative
ints and compute the element count (product of shape); map the declared dtype
(special-case "bfloat16" → uint16 element size 2) to a numpy dtype and determine
expected_bytes = element_size * element_count, then verify len(tensor_data.data)
== expected_bytes and raise a ValueError with a clear message if the dtype is
unknown, shape is invalid, or the byte length mismatches; perform these checks
before any np.frombuffer/reshape calls in _tensor_from_proto so malformed
payloads raise deterministic ValueError client errors.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1893392f-f6e6-4a57-b70f-7908560af4fc

📥 Commits

Reviewing files that changed from the base of the PR and between abf25a3 and 5e0e753.

📒 Files selected for processing (8)

crates/grpc_client/proto/tokenspeed_scheduler.proto
crates/grpc_client/src/tokenspeed_scheduler.rs
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
model_gateway/src/routers/grpc/client.rs
model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs

coderabbitai · 2026-05-22T07:29:52Z

+        if request.HasField("mm_inputs") and request.mm_inputs.HasField("pixel_values"):
+            precomputed_mm = self._mm_inputs_from_proto(
+                request.mm_inputs,
+                model_dtype=getattr(self.async_llm.model_config, "dtype", None),
+            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject partial multimodal payloads instead of silently dropping them.

When mm_inputs is present but pixel_values is missing, the request currently falls back to text-only execution. That hides malformed payloads and can produce incorrect results; fail fast with INVALID_ARGUMENT by raising ValueError.

Proposed fix

- precomputed_mm = None - if request.HasField("mm_inputs") and request.mm_inputs.HasField("pixel_values"): - precomputed_mm = self._mm_inputs_from_proto( - request.mm_inputs, - model_dtype=getattr(self.async_llm.model_config, "dtype", None), - ) + precomputed_mm = None + if request.HasField("mm_inputs"): + if not request.mm_inputs.HasField("pixel_values"): + raise ValueError( + "GenerateRequest.mm_inputs.pixel_values is required when mm_inputs is set" + ) + precomputed_mm = self._mm_inputs_from_proto( + request.mm_inputs, + model_dtype=getattr(self.async_llm.model_config, "dtype", None), + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 579 - 583, The handler currently treats a present mm_inputs without pixel_values as text-only; instead, in the block that checks request.HasField("mm_inputs") you should detect the malformed partial multimodal payload (i.e., request.mm_inputs is present but not request.mm_inputs.HasField("pixel_values")) and raise a ValueError so the RPC returns INVALID_ARGUMENT; otherwise, continue to call _mm_inputs_from_proto(request.mm_inputs, model_dtype=getattr(self.async_llm.model_config, "dtype", None)) as before.

Extend the TokenSpeed gRPC path to carry the gateway's preprocessed multimodal tensors, so an image request routed to a TokenSpeed worker is served instead of rejected at the router. - proto: add MultimodalInputs (pixel_values + model-specific tensors + placeholder offsets + im_token_id), GenerateRequest.mm_inputs, and GetModelInfoResponse.supports_vision. - gateway: route the per-engine multimodal assembly to a new MultimodalData::TokenSpeed variant; the client forwards the preprocessed tensors instead of returning an error. - servicer: reconstruct the proto into the engine's MultimodalInputs and set precomputed_multimodal_inputs so the engine skips its own preprocessing; raise the gRPC message-size ceiling for large VLM pixel_values. Signed-off-by: chenht2022 <chenht2022@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 36-38: Replace the duplicated hardcoded near-2GiB values for
"grpc.max_send_message_length" and "grpc.max_receive_message_length" with a
single config-backed constant and validator used by both the server and warmup
channel; add a module-level constant (e.g., MAX_GRPC_MESSAGE_BYTES) and a helper
function (e.g., validate_grpc_message_size(size)) that enforces a sane upper
bound (<= (2 << 30) - 1) and a reasonable default, load/override it from
configuration/env, and replace the two tuple entries in server.py with the
validated constant so both call sites reference the same shared value and
validation logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 68eb0c6e-7f9a-43b1-bcaa-b5bbec4d22d6

📥 Commits

Reviewing files that changed from the base of the PR and between 5e0e753 and 73e54b5.

📒 Files selected for processing (8)

crates/grpc_client/proto/tokenspeed_scheduler.proto
crates/grpc_client/src/tokenspeed_scheduler.rs
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
model_gateway/src/routers/grpc/client.rs
model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs

Kimi-K2.5 + Qwen3-VL inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>

Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>

lightseek-bot · 2026-05-25T06:28:02Z

@codex review

chatgpt-codex-connector · 2026-05-25T06:33:19Z

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bc7e0fdb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T06:59:39Z

+from tokenspeed.runtime.multimodal.inputs import (
+    Modality,
+    MultimodalDataItem,
+    MultimodalInputs,
+)


Move TokenSpeed multimodal imports behind lazy loading

This file explicitly documents that it should remain importable in environments that stub TokenSpeed (_lazy_generate_req_input comment), but these new top-level imports force tokenspeed to be installed at module import time. In test/tooling contexts that only exercise request-conversion code (and intentionally don’t install the full TokenSpeed runtime), importing this module now fails with ModuleNotFoundError before any lazy path can run. Please load these symbols lazily inside the multimodal conversion path (or gate them similarly to other type-only/lazy imports) to preserve the existing import-time contract.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc99de62c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T07:08:41Z

+            t = torch.from_numpy(
+                np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
+            )


Handle uint32 TensorData without torch.from_numpy

_tensor_from_proto decodes non-bfloat tensors via np.frombuffer(..., dtype=np.dtype(tensor_data.dtype)) and then calls torch.from_numpy(...). In this commit, TensorData explicitly allows "uint32", and the gateway can forward uint32 model-specific tensors, but torch.from_numpy does not accept np.uint32; that path raises a TypeError and the RPC is surfaced as INTERNAL instead of successfully building the request. Any multimodal model that includes unsigned tensor metadata will fail at request conversion unless uint32 is converted through a supported torch dtype path.

Useful? React with 👍 / 👎.

Addresses the open review comments on the multimodal servicer/server: - ``servicer.py``: pull ``torch``, ``numpy``, and the ``tokenspeed.runtime.multimodal.inputs`` symbols out of in-function / ``TYPE_CHECKING`` lazy blocks and up to the regular top-of-file import section. The remaining ``TYPE_CHECKING`` block keeps only ``AsyncLLM`` / ``ServerArgs`` since those are still type-only. - ``server.py``: replace the two open-coded ``(2 << 30) - 1`` gRPC ``max_send/receive_message_length`` literals with a single ``_grpc_max_message_bytes`` helper backed by ``TOKENSPEED_GRPC_MAX_MESSAGE_BYTES``. Default still tops out at the gRPC hard ceiling (2 GiB - 1) so VLM ``pixel_values`` aren't refused, but deployments can now lower it from the outside; bad values fall back to the default with a warning. Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>

The simplified install path runs ``uv pip install -e ... --no-build-isolation`` on three TokenSpeed packages, but neither ``./python`` nor ``tokenspeed-kernel`` lists ``setuptools`` in their ``build-system.requires``. Without build isolation the build backend resolves those imports from the active venv — uv fails with ``No module named 'setuptools'``. Preseed the three build-time tooling packages (``setuptools``, ``wheel``, ``pybind11``) before the editable installs. Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>

The previous commit installed ``./python`` first, but its runtime dependencies transitively pull in ``tokenspeed-triton``, populating ``site-packages/tokenspeed_triton/backends/nvidia/include`` before the kernel build runs. tokenspeed-kernel's ``setup.py`` walks site-packages and adds that path to nvcc's ``-I`` list — and the bundled ``crt/host_runtime.h`` defines ``__cudaLaunch`` as a CUDA-12 1-arg macro that conflicts with the 2-arg call nvcc 13.0 emits in its stub (``error: macro "__cudaLaunch" passed 2 arguments, but takes just 1``). Build the kernel first so its setup.py walk sees a clean site-packages, then install scheduler + ``./python``. Mirrors the ``kernel → scheduler → engine`` ordering in tokenspeed's own ``test/ci_system/install_deps.sh``. Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>

github-actions Bot added grpc gRPC client and router changes model-gateway Model gateway crate changes labels May 21, 2026

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

chenht2022 force-pushed the feat/tokenspeed-multimodal branch from 6bc626f to 5fad09e Compare May 21, 2026 16:17

chenht2022 marked this pull request as ready for review May 21, 2026 16:17

chenht2022 requested review from CatherineSue and slin1237 as code owners May 21, 2026 16:17

Copilot AI review requested due to automatic review settings May 21, 2026 16:17

chenht2022 requested a review from key4ng as a code owner May 21, 2026 16:17

Copilot started reviewing on behalf of chenht2022 May 21, 2026 16:18 View session

coderabbitai Bot requested changes May 21, 2026

View reviewed changes

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

Copilot AI reviewed May 21, 2026

View reviewed changes

chenht2022 force-pushed the feat/tokenspeed-multimodal branch from 5fad09e to abf25a3 Compare May 21, 2026 16:41

coderabbitai Bot requested changes May 21, 2026

View reviewed changes

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py

chenht2022 marked this pull request as draft May 21, 2026 17:38

chenht2022 force-pushed the feat/tokenspeed-multimodal branch 3 times, most recently from f98be10 to 5e0e753 Compare May 22, 2026 07:22

chenht2022 marked this pull request as ready for review May 22, 2026 07:24

coderabbitai Bot requested changes May 22, 2026

View reviewed changes

lightseek-bot approved these changes May 22, 2026

View reviewed changes

chenht2022 force-pushed the feat/tokenspeed-multimodal branch from 5e0e753 to 73e54b5 Compare May 24, 2026 08:02

coderabbitai Bot requested changes May 24, 2026

View reviewed changes

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/server.py Outdated

chenht2022 mentioned this pull request May 24, 2026

feat(runtime): support multimodal VLM lightseekorg/tokenspeed#236

Merged

1 task

Merge branch 'main' into feat/tokenspeed-multimodal

ca040d0

lightseek-bot assigned CatherineSue and slin1237 May 25, 2026

lightseek-bot added the priority:high High priority label May 25, 2026

CatherineSue reviewed May 25, 2026

View reviewed changes

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

lightseek-bot requested a review from CatherineSue May 25, 2026 06:57

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

zhyncs force-pushed the feat/tokenspeed-multimodal branch from 0bc7e0f to bc99de6 Compare May 25, 2026 07:01

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

zhyncs force-pushed the feat/tokenspeed-multimodal branch 2 times, most recently from 01f39bb to a3aa2be Compare May 25, 2026 07:23

zhyncs requested review from XinyueZhang369 and gongwei-130 as code owners May 25, 2026 07:23

zhyncs force-pushed the feat/tokenspeed-multimodal branch 4 times, most recently from 7dd0452 to 58fb125 Compare May 25, 2026 08:13

zhyncs force-pushed the feat/tokenspeed-multimodal branch from 58fb125 to 818d022 Compare May 25, 2026 08:18

zhyncs added 2 commits May 25, 2026 08:28

Conversation

chenht2022 commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lightseek-bot commented May 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chenht2022 commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading