Skip to content

feat(grpc): add TokenSpeed multimodal (VLM) input support#1515

Open
chenht2022 wants to merge 5 commits into
lightseekorg:mainfrom
chenht2022:feat/tokenspeed-multimodal
Open

feat(grpc): add TokenSpeed multimodal (VLM) input support#1515
chenht2022 wants to merge 5 commits into
lightseekorg:mainfrom
chenht2022:feat/tokenspeed-multimodal

Conversation

@chenht2022
Copy link
Copy Markdown

@chenht2022 chenht2022 commented May 21, 2026

Description

Problem

TokenSpeed is wired into the SMG gateway for text generation, but its gRPC path carries no multimodal inputs: an image request routed to a TokenSpeed worker is rejected at the router ("TokenSpeed backend does not support multimodal inputs").

Solution

Extend the TokenSpeed gRPC path to carry the gateway's already-preprocessed multimodal tensors. The gateway performs per-model image preprocessing in crates/multimodal; the servicer reconstructs the tensors and sets precomputed_multimodal_inputs so the engine consumes them directly and skips its own preprocessing.

Changes

  • crates/grpc_client/proto/tokenspeed_scheduler.proto: add MultimodalInputs / TensorData / PlaceholderRange, GenerateRequest.mm_inputs, and GetModelInfoResponse.supports_vision.
  • crates/grpc_client/src/tokenspeed_scheduler.rs, model_gateway/src/routers/grpc/{client,multimodal,proto_wrapper}.rs: assemble and forward TokenSpeed multimodal requests via a new MultimodalData::TokenSpeed variant.
  • grpc_servicer/smg_grpc_servicer/tokenspeed/{servicer,server}.py: decode the precomputed multimodal payload into the engine's inputs; raise the gRPC message-size ceiling for large VLM pixel_values.

Test Plan

  • cargo +nightly fmt --all and cargo clippy --workspace --all-targets --all-features -- -D warnings pass; Python ruff format / ruff check are clean. (cargo test runs in CI.)
  • The Python proto stubs regenerate cleanly and the servicer imports and round-trips the new MultimodalInputs message.
  • Scoped to TokenSpeed: only the TokenSpeed router arm, proto, and servicer change; the other backends' assembly paths and the shared crates/multimodal crate are untouched.
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • End-to-end multimodal support for TokenSpeed: send preprocessed image/audio tensors with placeholder offset handling.
    • Chat and messages requests can include multimodal payloads for vision/audio generation.
    • Models now report a supports_vision flag so routers can route multimodal-capable models.
  • Chores

    • Increased gRPC message size limits to handle large image/video payloads.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds TokenSpeed multimodal proto types and GetModelInfo vision flag; propagates mm_inputs through Rust client builders and gateway assembly; implements gateway->tokenspeed proto conversion and routing; servicer decodes precomputed tensor payloads (including bfloat16 handling) and increases gRPC message limits to 2 GiB.

Changes

TokenSpeed Multimodal End-to-End Support

Layer / File(s) Summary
Proto schema: multimodal messages and fields
crates/grpc_client/proto/tokenspeed_scheduler.proto
Adds MultimodalInputs, TensorData, PlaceholderRange message types and mm_inputs field to GenerateRequest; adds supports_vision boolean to GetModelInfoResponse.
Rust client request builders: mm_inputs wiring
crates/grpc_client/src/tokenspeed_scheduler.rs
build_generate_request_from_chat and build_generate_request_from_messages now accept multimodal_inputs and populate GenerateRequest.mm_inputs; build_plain_generate_request explicitly sets mm_inputs = None.
Gateway multimodal assembly: TokenSpeed-specific routing and proto conversion
model_gateway/src/routers/grpc/proto_wrapper.rs, model_gateway/src/routers/grpc/multimodal.rs
Adds TokenSpeedMultimodalData and MultimodalData::TokenSpeed variant; implements into_proto to create tokenspeed::MultimodalInputs; adds assemble_tokenspeed to serialize pixel/model-specific tensors and compute placeholder ranges; updates dispatch to route TokenSpeed to assembler and clears mm_inputs appropriately.
Gateway request builders: multimodal proto passing
model_gateway/src/routers/grpc/client.rs, model_gateway/src/routers/grpc/harmony/stages/request_building.rs
build_chat_request and build_messages_request construct TokenSpeed multimodal proto from assembled multimodal data and pass it into client request builders; Harmony path passes explicit None where not wired.
Servicer: multimodal decoding and model info vision flag
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
_build_generate_req reconstructs precomputed multimodal inputs from request.mm_inputs using _mm_inputs_from_proto and passes them as precomputed_multimodal_inputs; GetModelInfo sets supports_vision from model config; adds _tensor_from_proto helper (supports bfloat16 reinterpretation and dtype casting).
Infrastructure: gRPC message size limits increased to 2GB
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
Server and warmup channel options updated to set grpc.max_send_message_length and grpc.max_receive_message_length to (2 << 30) - 1 bytes.

Sequence Diagram(s)

sequenceDiagram
  participant ModelGateway
  participant GrpcClient
  participant TokenSpeedServicer._build_generate_req
  participant TokenSpeedServicer._mm_inputs_from_proto
  participant TokenSpeedServicer._tensor_from_proto
  participant Engine
  ModelGateway->>GrpcClient: send GenerateRequest(mm_inputs)
  GrpcClient->>TokenSpeedServicer._build_generate_req: gRPC Generate call
  _build_generate_req->>_mm_inputs_from_proto: pass mm_inputs proto
  _mm_inputs_from_proto->>_tensor_from_proto: decode TensorData -> tensors
  _tensor_from_proto->>Engine: provide tensors for engine call
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

multimodal

Suggested reviewers

  • key4ng
  • slin1237
  • CatherineSue

Poem

🐰 In buffers snug and shapes that softly hum,

Pixel hops join tokens, now together they come.
Gateway packs the tensors, the client sends the call,
Servicer decodes and wakes the engine for all.
A rabbit twitches whiskers — multimodal for all!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding TokenSpeed multimodal (VLM) input support to the gRPC integration.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added grpc gRPC client and router changes model-gateway Model gateway crate changes labels May 21, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multimodal (text and image) support for the TokenSpeed scheduler. Key changes include updating gRPC proto definitions to include multimodal input structures, increasing gRPC message size limits to accommodate large image payloads, and implementing the necessary serialization and deserialization logic in both the Rust gateway and the Python servicer. A critical feedback point was raised regarding memory safety in the Python servicer, where tensors reconstructed from gRPC buffers could potentially alias transient memory, leading to corruption; a fix using explicit cloning was suggested.

Comment on lines +769 to +772
view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
if cast_to is not None and np.issubdtype(view.dtype, np.floating):
return torch.from_numpy(view).to(cast_to)
return torch.from_numpy(view.copy())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of _tensor_from_proto can return a tensor that aliases the transient bytes buffer of the gRPC request message. Specifically, when cast_to is provided and matches the input dtype, torch.from_numpy(view).to(cast_to) returns a view of the original memory. If the proto message is garbage collected before the engine finishes processing the tensor, this will lead to memory corruption or crashes.

To ensure safety as stated in the docstring ("the buffer is copied so it never aliases the transient proto bytes"), you should always ensure a copy is made. Using t.clone() is an idiomatic way to achieve this in PyTorch.

Suggested change
view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
if cast_to is not None and np.issubdtype(view.dtype, np.floating):
return torch.from_numpy(view).to(cast_to)
return torch.from_numpy(view.copy())
view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
t = torch.from_numpy(view)
if cast_to is not None and t.dtype != cast_to and np.issubdtype(view.dtype, np.floating):
return t.to(cast_to)
return t.clone()

@chenht2022 chenht2022 force-pushed the feat/tokenspeed-multimodal branch from 6bc626f to 5fad09e Compare May 21, 2026 16:17
@chenht2022 chenht2022 marked this pull request as ready for review May 21, 2026 16:17
Copilot AI review requested due to automatic review settings May 21, 2026 16:17
@chenht2022 chenht2022 requested a review from key4ng as a code owner May 21, 2026 16:17
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 726-728: The conversion from (offset, length) to inclusive ranges
in the mm_inputs.mm_placeholders handling can produce invalid ranges when a
placeholder has length == 0 (end = offset - 1); validate each placeholder before
converting by checking p.length > 0 and if any zero-length placeholder is found,
abort the RPC with INVALID_ARGUMENT (use the servicer's gRPC context.abort or
equivalent) so malformed payloads fail fast; update the logic around the offsets
= [...] comprehension in servicer.py to perform this validation on
mm_inputs.mm_placeholders and only build offsets after confirming all lengths
are positive.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fca3ad61-f637-49f7-ba95-7be1c95f15d9

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed4954 and 5fad09e.

📒 Files selected for processing (8)
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/src/tokenspeed_scheduler.rs
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multimodal (vision) request forwarding for the TokenSpeed gRPC backend by extending the TokenSpeed proto to carry preprocessed tensors from the gateway and decoding them in the TokenSpeed Python servicer so the engine can consume precomputed_multimodal_inputs.

Changes:

  • Extend TokenSpeed scheduler proto with MultimodalInputs (tensor payload + placeholders) and supports_vision in GetModelInfoResponse.
  • Add a TokenSpeed-specific multimodal assembly path in the gateway and plumb mm_inputs through the TokenSpeed gRPC client request builders.
  • Decode multimodal tensors in the TokenSpeed servicer and raise gRPC message-size limits to accommodate large pixel tensors.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
model_gateway/src/routers/grpc/proto_wrapper.rs Adds MultimodalData::TokenSpeed and converts TokenSpeed multimodal payloads into TokenSpeed proto types.
model_gateway/src/routers/grpc/multimodal.rs Assembles TokenSpeed multimodal tensors/placeholders from the gateway’s intermediate representation.
model_gateway/src/routers/grpc/harmony/stages/request_building.rs Updates TokenSpeed Harmony request building call site to pass the new multimodal argument (currently None).
model_gateway/src/routers/grpc/client.rs Plumbs TokenSpeed multimodal proto into TokenSpeed request builders for chat/messages.
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Reconstructs engine multimodal inputs from proto and passes them via precomputed_multimodal_inputs.
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py Raises gRPC max send/receive message sizes to support large VLM payloads.
crates/grpc_client/src/tokenspeed_scheduler.rs Adds mm_inputs parameter support to TokenSpeed request builder APIs and populates the proto field.
crates/grpc_client/proto/tokenspeed_scheduler.proto Defines multimodal messages (TensorData, MultimodalInputs, PlaceholderRange) and adds supports_vision.
Comments suppressed due to low confidence (1)

grpc_servicer/smg_grpc_servicer/tokenspeed/server.py:129

  • The warmup probe channel also sets grpc.max_send/receive_message_length to 2 GiB - 1. If you keep the higher ceiling, it should be driven by the same configuration as the server options; otherwise warmup and runtime can diverge and the higher limit still increases memory risk in the client process.
    channel = grpc.insecure_channel(
        grpc_url,
        options=[
            ("grpc.max_send_message_length", (2 << 30) - 1),
            ("grpc.max_receive_message_length", (2 << 30) - 1),
        ],

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +770 to +772
if cast_to is not None and np.issubdtype(view.dtype, np.floating):
return torch.from_numpy(view).to(cast_to)
return torch.from_numpy(view.copy())
Comment on lines 33 to 41
server = grpc.aio.server(
futures.ThreadPoolExecutor(max_workers=10),
options=[
("grpc.max_send_message_length", 1024 * 1024 * 256),
("grpc.max_receive_message_length", 1024 * 1024 * 256),
# VLM pixel_values can be large; raise to gRPC's max (2 GiB - 1).
("grpc.max_send_message_length", (2 << 30) - 1),
("grpc.max_receive_message_length", (2 << 30) - 1),
# Permissive keepalive so long prefill stalls don't trip GOAWAY.
("grpc.http2.min_recv_ping_interval_without_data_ms", 10000),
("grpc.keepalive_permit_without_calls", True),
@chenht2022 chenht2022 force-pushed the feat/tokenspeed-multimodal branch from 5fad09e to abf25a3 Compare May 21, 2026 16:41
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)

726-728: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate placeholder length before converting to inclusive ranges.

(offset, length)(start, end) can produce an invalid end when length == 0. Reject malformed placeholders as INVALID_ARGUMENT before building offsets.

Suggested fix
-        offsets = [(p.offset, p.offset + p.length - 1) for p in mm_inputs.mm_placeholders]
+        offsets = []
+        for p in mm_inputs.mm_placeholders:
+            if p.length == 0:
+                raise ValueError("mm_placeholders.length must be > 0")
+            offsets.append((p.offset, p.offset + p.length - 1))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 726 -
728, Before converting placeholders to inclusive ranges in the block that builds
offsets from mm_inputs.mm_placeholders, validate each placeholder's length
(p.length) to ensure it's > 0 and reject any malformed placeholder with a gRPC
INVALID_ARGUMENT response; specifically, check mm_inputs.mm_placeholders for any
p.length <= 0 and abort/return with grpc.StatusCode.INVALID_ARGUMENT and a clear
message before executing the offsets = [(p.offset, p.offset + p.length - 1) ...]
comprehension so you never produce an invalid end value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 748-773: In _tensor_from_proto, validate tensor_data.dtype,
tensor_data.shape and tensor_data.data length before deserializing: ensure dtype
is one of the supported types (handle "bfloat16" specially), ensure shape is a
sequence of non-negative ints and compute the element count (product of shape,
treating empty shape as 1), determine the expected byte size (use numpy
dtype.itemsize for normal dtypes; for "bfloat16" use uint16/itemsize=2), and
compare expected_bytes to len(tensor_data.data) raising a clear ValueError if
mismatched or dtype unsupported; keep the existing bfloat16 path and casting
behavior but only proceed to np.frombuffer/torch.from_numpy after the checks
pass to avoid late failures or unexpected large allocations.

---

Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 726-728: Before converting placeholders to inclusive ranges in the
block that builds offsets from mm_inputs.mm_placeholders, validate each
placeholder's length (p.length) to ensure it's > 0 and reject any malformed
placeholder with a gRPC INVALID_ARGUMENT response; specifically, check
mm_inputs.mm_placeholders for any p.length <= 0 and abort/return with
grpc.StatusCode.INVALID_ARGUMENT and a clear message before executing the
offsets = [(p.offset, p.offset + p.length - 1) ...] comprehension so you never
produce an invalid end value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ac5b7923-bbbe-48d9-bb74-174937fb3e1b

📥 Commits

Reviewing files that changed from the base of the PR and between 5fad09e and abf25a3.

📒 Files selected for processing (8)
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/src/tokenspeed_scheduler.rs
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
@chenht2022 chenht2022 marked this pull request as draft May 21, 2026 17:38
@chenht2022 chenht2022 force-pushed the feat/tokenspeed-multimodal branch 3 times, most recently from f98be10 to 5e0e753 Compare May 22, 2026 07:22
@chenht2022 chenht2022 marked this pull request as ready for review May 22, 2026 07:24
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)

750-772: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate tensor dtype and byte-size before np.frombuffer/reshape.

Malformed dtype/shape/data currently fails late and may surface as INTERNAL instead of a clear client error. Add explicit checks and raise ValueError for invalid payloads.

Proposed fix
     def _tensor_from_proto(
         tensor_data: tokenspeed_scheduler_pb2.TensorData,
         cast_to: torch.dtype | None = None,
     ):
@@
         import numpy as np
         import torch
 
         shape = list(tensor_data.shape)
+        numel = int(np.prod(shape, dtype=np.int64)) if shape else 1
 
         if tensor_data.dtype == "bfloat16":
+            expected = numel * 2
+            if len(tensor_data.data) != expected:
+                raise ValueError(
+                    f"bfloat16 tensor byte-size mismatch: got {len(tensor_data.data)}, expected {expected}"
+                )
             # numpy has no bfloat16 — read the raw bits as uint16, reinterpret.
             t = torch.from_numpy(
-                np.frombuffer(tensor_data.data, dtype=np.uint16).reshape(shape)
+                np.frombuffer(tensor_data.data, dtype=np.uint16, count=numel).reshape(shape)
             ).view(torch.bfloat16)
         else:
+            try:
+                np_dtype = np.dtype(tensor_data.dtype)
+            except TypeError as e:
+                raise ValueError(f"unsupported tensor dtype: {tensor_data.dtype}") from e
+            expected = numel * np_dtype.itemsize
+            if len(tensor_data.data) != expected:
+                raise ValueError(
+                    f"tensor byte-size mismatch: got {len(tensor_data.data)}, expected {expected}"
+                )
             t = torch.from_numpy(
-                np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
+                np.frombuffer(tensor_data.data, dtype=np_dtype, count=numel).reshape(shape)
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 750 -
772, The _tensor_from_proto function should validate tensor_data.dtype,
tensor_data.shape, and the byte-length of tensor_data.data before calling
np.frombuffer/reshape: ensure shape is a sequence of non-negative ints and
compute the element count (product of shape); map the declared dtype
(special-case "bfloat16" → uint16 element size 2) to a numpy dtype and determine
expected_bytes = element_size * element_count, then verify len(tensor_data.data)
== expected_bytes and raise a ValueError with a clear message if the dtype is
unknown, shape is invalid, or the byte length mismatches; perform these checks
before any np.frombuffer/reshape calls in _tensor_from_proto so malformed
payloads raise deterministic ValueError client errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 579-583: The handler currently treats a present mm_inputs without
pixel_values as text-only; instead, in the block that checks
request.HasField("mm_inputs") you should detect the malformed partial multimodal
payload (i.e., request.mm_inputs is present but not
request.mm_inputs.HasField("pixel_values")) and raise a ValueError so the RPC
returns INVALID_ARGUMENT; otherwise, continue to call
_mm_inputs_from_proto(request.mm_inputs,
model_dtype=getattr(self.async_llm.model_config, "dtype", None)) as before.

---

Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 750-772: The _tensor_from_proto function should validate
tensor_data.dtype, tensor_data.shape, and the byte-length of tensor_data.data
before calling np.frombuffer/reshape: ensure shape is a sequence of non-negative
ints and compute the element count (product of shape); map the declared dtype
(special-case "bfloat16" → uint16 element size 2) to a numpy dtype and determine
expected_bytes = element_size * element_count, then verify len(tensor_data.data)
== expected_bytes and raise a ValueError with a clear message if the dtype is
unknown, shape is invalid, or the byte length mismatches; perform these checks
before any np.frombuffer/reshape calls in _tensor_from_proto so malformed
payloads raise deterministic ValueError client errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1893392f-f6e6-4a57-b70f-7908560af4fc

📥 Commits

Reviewing files that changed from the base of the PR and between abf25a3 and 5e0e753.

📒 Files selected for processing (8)
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/src/tokenspeed_scheduler.rs
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment on lines +579 to +583
if request.HasField("mm_inputs") and request.mm_inputs.HasField("pixel_values"):
precomputed_mm = self._mm_inputs_from_proto(
request.mm_inputs,
model_dtype=getattr(self.async_llm.model_config, "dtype", None),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject partial multimodal payloads instead of silently dropping them.

When mm_inputs is present but pixel_values is missing, the request currently falls back to text-only execution. That hides malformed payloads and can produce incorrect results; fail fast with INVALID_ARGUMENT by raising ValueError.

Proposed fix
-        precomputed_mm = None
-        if request.HasField("mm_inputs") and request.mm_inputs.HasField("pixel_values"):
-            precomputed_mm = self._mm_inputs_from_proto(
-                request.mm_inputs,
-                model_dtype=getattr(self.async_llm.model_config, "dtype", None),
-            )
+        precomputed_mm = None
+        if request.HasField("mm_inputs"):
+            if not request.mm_inputs.HasField("pixel_values"):
+                raise ValueError(
+                    "GenerateRequest.mm_inputs.pixel_values is required when mm_inputs is set"
+                )
+            precomputed_mm = self._mm_inputs_from_proto(
+                request.mm_inputs,
+                model_dtype=getattr(self.async_llm.model_config, "dtype", None),
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 579 -
583, The handler currently treats a present mm_inputs without pixel_values as
text-only; instead, in the block that checks request.HasField("mm_inputs") you
should detect the malformed partial multimodal payload (i.e., request.mm_inputs
is present but not request.mm_inputs.HasField("pixel_values")) and raise a
ValueError so the RPC returns INVALID_ARGUMENT; otherwise, continue to call
_mm_inputs_from_proto(request.mm_inputs,
model_dtype=getattr(self.async_llm.model_config, "dtype", None)) as before.

Extend the TokenSpeed gRPC path to carry the gateway's preprocessed
multimodal tensors, so an image request routed to a TokenSpeed worker is
served instead of rejected at the router.

- proto: add MultimodalInputs (pixel_values + model-specific tensors +
  placeholder offsets + im_token_id), GenerateRequest.mm_inputs, and
  GetModelInfoResponse.supports_vision.
- gateway: route the per-engine multimodal assembly to a new
  MultimodalData::TokenSpeed variant; the client forwards the preprocessed
  tensors instead of returning an error.
- servicer: reconstruct the proto into the engine's MultimodalInputs and set
  precomputed_multimodal_inputs so the engine skips its own preprocessing;
  raise the gRPC message-size ceiling for large VLM pixel_values.

Signed-off-by: chenht2022 <chenht2022@gmail.com>
@chenht2022 chenht2022 force-pushed the feat/tokenspeed-multimodal branch from 5e0e753 to 73e54b5 Compare May 24, 2026 08:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 36-38: Replace the duplicated hardcoded near-2GiB values for
"grpc.max_send_message_length" and "grpc.max_receive_message_length" with a
single config-backed constant and validator used by both the server and warmup
channel; add a module-level constant (e.g., MAX_GRPC_MESSAGE_BYTES) and a helper
function (e.g., validate_grpc_message_size(size)) that enforces a sane upper
bound (<= (2 << 30) - 1) and a reasonable default, load/override it from
configuration/env, and replace the two tuple entries in server.py with the
validated constant so both call sites reference the same shared value and
validation logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 68eb0c6e-7f9a-43b1-bcaa-b5bbec4d22d6

📥 Commits

Reviewing files that changed from the base of the PR and between 5e0e753 and 73e54b5.

📒 Files selected for processing (8)
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/src/tokenspeed_scheduler.rs
  • grpc_servicer/smg_grpc_servicer/tokenspeed/server.py
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/server.py Outdated
chenht2022 added a commit to lightseekorg/tokenspeed that referenced this pull request May 24, 2026
Kimi-K2.5 + Qwen3-VL inference via SMG gateway inputs
(lightseekorg/smg#1515). OCRBench-validated.

Signed-off-by: chenht2022 <chenht2022@gmail.com>
chenht2022 added a commit to lightseekorg/tokenspeed that referenced this pull request May 24, 2026
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via
SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated.

Signed-off-by: chenht2022 <chenht2022@gmail.com>
chenht2022 added a commit to lightseekorg/tokenspeed that referenced this pull request May 24, 2026
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via
SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated.

Signed-off-by: chenht2022 <chenht2022@gmail.com>
chenht2022 added a commit to lightseekorg/tokenspeed that referenced this pull request May 24, 2026
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via
SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated.

Signed-off-by: chenht2022 <chenht2022@gmail.com>
chenht2022 added a commit to lightseekorg/tokenspeed that referenced this pull request May 24, 2026
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via
SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated.

Signed-off-by: chenht2022 <chenht2022@gmail.com>
@lightseek-bot
Copy link
Copy Markdown
Collaborator

@codex review

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bc7e0fdb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +30 to +34
from tokenspeed.runtime.multimodal.inputs import (
Modality,
MultimodalDataItem,
MultimodalInputs,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move TokenSpeed multimodal imports behind lazy loading

This file explicitly documents that it should remain importable in environments that stub TokenSpeed (_lazy_generate_req_input comment), but these new top-level imports force tokenspeed to be installed at module import time. In test/tooling contexts that only exercise request-conversion code (and intentionally don’t install the full TokenSpeed runtime), importing this module now fails with ModuleNotFoundError before any lazy path can run. Please load these symbols lazily inside the multimodal conversion path (or gate them similarly to other type-only/lazy imports) to preserve the existing import-time contract.

Useful? React with 👍 / 👎.

@zhyncs zhyncs force-pushed the feat/tokenspeed-multimodal branch from 0bc7e0f to bc99de6 Compare May 25, 2026 07:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc99de62c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +766 to +768
t = torch.from_numpy(
np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle uint32 TensorData without torch.from_numpy

_tensor_from_proto decodes non-bfloat tensors via np.frombuffer(..., dtype=np.dtype(tensor_data.dtype)) and then calls torch.from_numpy(...). In this commit, TensorData explicitly allows "uint32", and the gateway can forward uint32 model-specific tensors, but torch.from_numpy does not accept np.uint32; that path raises a TypeError and the RPC is surfaced as INTERNAL instead of successfully building the request. Any multimodal model that includes unsigned tensor metadata will fail at request conversion unless uint32 is converted through a supported torch dtype path.

Useful? React with 👍 / 👎.

@zhyncs zhyncs force-pushed the feat/tokenspeed-multimodal branch 2 times, most recently from 01f39bb to a3aa2be Compare May 25, 2026 07:23
@zhyncs zhyncs force-pushed the feat/tokenspeed-multimodal branch 4 times, most recently from 7dd0452 to 58fb125 Compare May 25, 2026 08:13
Addresses the open review comments on the multimodal servicer/server:

- ``servicer.py``: pull ``torch``, ``numpy``, and the
  ``tokenspeed.runtime.multimodal.inputs`` symbols out of in-function /
  ``TYPE_CHECKING`` lazy blocks and up to the regular top-of-file
  import section. The remaining ``TYPE_CHECKING`` block keeps only
  ``AsyncLLM`` / ``ServerArgs`` since those are still type-only.
- ``server.py``: replace the two open-coded ``(2 << 30) - 1`` gRPC
  ``max_send/receive_message_length`` literals with a single
  ``_grpc_max_message_bytes`` helper backed by
  ``TOKENSPEED_GRPC_MAX_MESSAGE_BYTES``. Default still tops out at the
  gRPC hard ceiling (2 GiB - 1) so VLM ``pixel_values`` aren't refused,
  but deployments can now lower it from the outside; bad values fall
  back to the default with a warning.

Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>
@zhyncs zhyncs force-pushed the feat/tokenspeed-multimodal branch from 58fb125 to 818d022 Compare May 25, 2026 08:18
zhyncs added 2 commits May 25, 2026 08:28
The simplified install path runs ``uv pip install -e ... --no-build-isolation``
on three TokenSpeed packages, but neither ``./python`` nor
``tokenspeed-kernel`` lists ``setuptools`` in their
``build-system.requires``. Without build isolation the build backend
resolves those imports from the active venv — uv fails with
``No module named 'setuptools'``.

Preseed the three build-time tooling packages (``setuptools``,
``wheel``, ``pybind11``) before the editable installs.

Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>
The previous commit installed ``./python`` first, but its runtime
dependencies transitively pull in ``tokenspeed-triton``, populating
``site-packages/tokenspeed_triton/backends/nvidia/include`` before the
kernel build runs. tokenspeed-kernel's ``setup.py`` walks site-packages
and adds that path to nvcc's ``-I`` list — and the bundled
``crt/host_runtime.h`` defines ``__cudaLaunch`` as a CUDA-12 1-arg
macro that conflicts with the 2-arg call nvcc 13.0 emits in its stub
(``error: macro "__cudaLaunch" passed 2 arguments, but takes just 1``).

Build the kernel first so its setup.py walk sees a clean site-packages,
then install scheduler + ``./python``. Mirrors the ``kernel → scheduler
→ engine`` ordering in tokenspeed's own
``test/ci_system/install_deps.sh``.

Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes priority:high High priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants