feat(grpc): add TokenSpeed multimodal (VLM) input support#1515
feat(grpc): add TokenSpeed multimodal (VLM) input support#1515chenht2022 wants to merge 5 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds TokenSpeed multimodal proto types and GetModelInfo vision flag; propagates mm_inputs through Rust client builders and gateway assembly; implements gateway->tokenspeed proto conversion and routing; servicer decodes precomputed tensor payloads (including bfloat16 handling) and increases gRPC message limits to 2 GiB. ChangesTokenSpeed Multimodal End-to-End Support
Sequence Diagram(s)sequenceDiagram
participant ModelGateway
participant GrpcClient
participant TokenSpeedServicer._build_generate_req
participant TokenSpeedServicer._mm_inputs_from_proto
participant TokenSpeedServicer._tensor_from_proto
participant Engine
ModelGateway->>GrpcClient: send GenerateRequest(mm_inputs)
GrpcClient->>TokenSpeedServicer._build_generate_req: gRPC Generate call
_build_generate_req->>_mm_inputs_from_proto: pass mm_inputs proto
_mm_inputs_from_proto->>_tensor_from_proto: decode TensorData -> tensors
_tensor_from_proto->>Engine: provide tensors for engine call
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces multimodal (text and image) support for the TokenSpeed scheduler. Key changes include updating gRPC proto definitions to include multimodal input structures, increasing gRPC message size limits to accommodate large image payloads, and implementing the necessary serialization and deserialization logic in both the Rust gateway and the Python servicer. A critical feedback point was raised regarding memory safety in the Python servicer, where tensors reconstructed from gRPC buffers could potentially alias transient memory, leading to corruption; a fix using explicit cloning was suggested.
| view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape) | ||
| if cast_to is not None and np.issubdtype(view.dtype, np.floating): | ||
| return torch.from_numpy(view).to(cast_to) | ||
| return torch.from_numpy(view.copy()) |
There was a problem hiding this comment.
The current implementation of _tensor_from_proto can return a tensor that aliases the transient bytes buffer of the gRPC request message. Specifically, when cast_to is provided and matches the input dtype, torch.from_numpy(view).to(cast_to) returns a view of the original memory. If the proto message is garbage collected before the engine finishes processing the tensor, this will lead to memory corruption or crashes.
To ensure safety as stated in the docstring ("the buffer is copied so it never aliases the transient proto bytes"), you should always ensure a copy is made. Using t.clone() is an idiomatic way to achieve this in PyTorch.
| view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape) | |
| if cast_to is not None and np.issubdtype(view.dtype, np.floating): | |
| return torch.from_numpy(view).to(cast_to) | |
| return torch.from_numpy(view.copy()) | |
| view = np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape) | |
| t = torch.from_numpy(view) | |
| if cast_to is not None and t.dtype != cast_to and np.issubdtype(view.dtype, np.floating): | |
| return t.to(cast_to) | |
| return t.clone() |
6bc626f to
5fad09e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 726-728: The conversion from (offset, length) to inclusive ranges
in the mm_inputs.mm_placeholders handling can produce invalid ranges when a
placeholder has length == 0 (end = offset - 1); validate each placeholder before
converting by checking p.length > 0 and if any zero-length placeholder is found,
abort the RPC with INVALID_ARGUMENT (use the servicer's gRPC context.abort or
equivalent) so malformed payloads fail fast; update the logic around the offsets
= [...] comprehension in servicer.py to perform this validation on
mm_inputs.mm_placeholders and only build offsets after confirming all lengths
are positive.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: fca3ad61-f637-49f7-ba95-7be1c95f15d9
📒 Files selected for processing (8)
crates/grpc_client/proto/tokenspeed_scheduler.protocrates/grpc_client/src/tokenspeed_scheduler.rsgrpc_servicer/smg_grpc_servicer/tokenspeed/server.pygrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pymodel_gateway/src/routers/grpc/client.rsmodel_gateway/src/routers/grpc/harmony/stages/request_building.rsmodel_gateway/src/routers/grpc/multimodal.rsmodel_gateway/src/routers/grpc/proto_wrapper.rs
There was a problem hiding this comment.
Pull request overview
Adds multimodal (vision) request forwarding for the TokenSpeed gRPC backend by extending the TokenSpeed proto to carry preprocessed tensors from the gateway and decoding them in the TokenSpeed Python servicer so the engine can consume precomputed_multimodal_inputs.
Changes:
- Extend TokenSpeed scheduler proto with
MultimodalInputs(tensor payload + placeholders) andsupports_visioninGetModelInfoResponse. - Add a TokenSpeed-specific multimodal assembly path in the gateway and plumb
mm_inputsthrough the TokenSpeed gRPC client request builders. - Decode multimodal tensors in the TokenSpeed servicer and raise gRPC message-size limits to accommodate large pixel tensors.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| model_gateway/src/routers/grpc/proto_wrapper.rs | Adds MultimodalData::TokenSpeed and converts TokenSpeed multimodal payloads into TokenSpeed proto types. |
| model_gateway/src/routers/grpc/multimodal.rs | Assembles TokenSpeed multimodal tensors/placeholders from the gateway’s intermediate representation. |
| model_gateway/src/routers/grpc/harmony/stages/request_building.rs | Updates TokenSpeed Harmony request building call site to pass the new multimodal argument (currently None). |
| model_gateway/src/routers/grpc/client.rs | Plumbs TokenSpeed multimodal proto into TokenSpeed request builders for chat/messages. |
| grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py | Reconstructs engine multimodal inputs from proto and passes them via precomputed_multimodal_inputs. |
| grpc_servicer/smg_grpc_servicer/tokenspeed/server.py | Raises gRPC max send/receive message sizes to support large VLM payloads. |
| crates/grpc_client/src/tokenspeed_scheduler.rs | Adds mm_inputs parameter support to TokenSpeed request builder APIs and populates the proto field. |
| crates/grpc_client/proto/tokenspeed_scheduler.proto | Defines multimodal messages (TensorData, MultimodalInputs, PlaceholderRange) and adds supports_vision. |
Comments suppressed due to low confidence (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/server.py:129
- The warmup probe channel also sets grpc.max_send/receive_message_length to 2 GiB - 1. If you keep the higher ceiling, it should be driven by the same configuration as the server options; otherwise warmup and runtime can diverge and the higher limit still increases memory risk in the client process.
channel = grpc.insecure_channel(
grpc_url,
options=[
("grpc.max_send_message_length", (2 << 30) - 1),
("grpc.max_receive_message_length", (2 << 30) - 1),
],
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if cast_to is not None and np.issubdtype(view.dtype, np.floating): | ||
| return torch.from_numpy(view).to(cast_to) | ||
| return torch.from_numpy(view.copy()) |
| server = grpc.aio.server( | ||
| futures.ThreadPoolExecutor(max_workers=10), | ||
| options=[ | ||
| ("grpc.max_send_message_length", 1024 * 1024 * 256), | ||
| ("grpc.max_receive_message_length", 1024 * 1024 * 256), | ||
| # VLM pixel_values can be large; raise to gRPC's max (2 GiB - 1). | ||
| ("grpc.max_send_message_length", (2 << 30) - 1), | ||
| ("grpc.max_receive_message_length", (2 << 30) - 1), | ||
| # Permissive keepalive so long prefill stalls don't trip GOAWAY. | ||
| ("grpc.http2.min_recv_ping_interval_without_data_ms", 10000), | ||
| ("grpc.keepalive_permit_without_calls", True), |
5fad09e to
abf25a3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)
726-728:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate placeholder length before converting to inclusive ranges.
(offset, length)→(start, end)can produce an invalid end whenlength == 0. Reject malformed placeholders asINVALID_ARGUMENTbefore buildingoffsets.Suggested fix
- offsets = [(p.offset, p.offset + p.length - 1) for p in mm_inputs.mm_placeholders] + offsets = [] + for p in mm_inputs.mm_placeholders: + if p.length == 0: + raise ValueError("mm_placeholders.length must be > 0") + offsets.append((p.offset, p.offset + p.length - 1))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 726 - 728, Before converting placeholders to inclusive ranges in the block that builds offsets from mm_inputs.mm_placeholders, validate each placeholder's length (p.length) to ensure it's > 0 and reject any malformed placeholder with a gRPC INVALID_ARGUMENT response; specifically, check mm_inputs.mm_placeholders for any p.length <= 0 and abort/return with grpc.StatusCode.INVALID_ARGUMENT and a clear message before executing the offsets = [(p.offset, p.offset + p.length - 1) ...] comprehension so you never produce an invalid end value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 748-773: In _tensor_from_proto, validate tensor_data.dtype,
tensor_data.shape and tensor_data.data length before deserializing: ensure dtype
is one of the supported types (handle "bfloat16" specially), ensure shape is a
sequence of non-negative ints and compute the element count (product of shape,
treating empty shape as 1), determine the expected byte size (use numpy
dtype.itemsize for normal dtypes; for "bfloat16" use uint16/itemsize=2), and
compare expected_bytes to len(tensor_data.data) raising a clear ValueError if
mismatched or dtype unsupported; keep the existing bfloat16 path and casting
behavior but only proceed to np.frombuffer/torch.from_numpy after the checks
pass to avoid late failures or unexpected large allocations.
---
Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 726-728: Before converting placeholders to inclusive ranges in the
block that builds offsets from mm_inputs.mm_placeholders, validate each
placeholder's length (p.length) to ensure it's > 0 and reject any malformed
placeholder with a gRPC INVALID_ARGUMENT response; specifically, check
mm_inputs.mm_placeholders for any p.length <= 0 and abort/return with
grpc.StatusCode.INVALID_ARGUMENT and a clear message before executing the
offsets = [(p.offset, p.offset + p.length - 1) ...] comprehension so you never
produce an invalid end value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ac5b7923-bbbe-48d9-bb74-174937fb3e1b
📒 Files selected for processing (8)
crates/grpc_client/proto/tokenspeed_scheduler.protocrates/grpc_client/src/tokenspeed_scheduler.rsgrpc_servicer/smg_grpc_servicer/tokenspeed/server.pygrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pymodel_gateway/src/routers/grpc/client.rsmodel_gateway/src/routers/grpc/harmony/stages/request_building.rsmodel_gateway/src/routers/grpc/multimodal.rsmodel_gateway/src/routers/grpc/proto_wrapper.rs
f98be10 to
5e0e753
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py (1)
750-772:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate tensor dtype and byte-size before
np.frombuffer/reshape.Malformed
dtype/shape/datacurrently fails late and may surface as INTERNAL instead of a clear client error. Add explicit checks and raiseValueErrorfor invalid payloads.Proposed fix
def _tensor_from_proto( tensor_data: tokenspeed_scheduler_pb2.TensorData, cast_to: torch.dtype | None = None, ): @@ import numpy as np import torch shape = list(tensor_data.shape) + numel = int(np.prod(shape, dtype=np.int64)) if shape else 1 if tensor_data.dtype == "bfloat16": + expected = numel * 2 + if len(tensor_data.data) != expected: + raise ValueError( + f"bfloat16 tensor byte-size mismatch: got {len(tensor_data.data)}, expected {expected}" + ) # numpy has no bfloat16 — read the raw bits as uint16, reinterpret. t = torch.from_numpy( - np.frombuffer(tensor_data.data, dtype=np.uint16).reshape(shape) + np.frombuffer(tensor_data.data, dtype=np.uint16, count=numel).reshape(shape) ).view(torch.bfloat16) else: + try: + np_dtype = np.dtype(tensor_data.dtype) + except TypeError as e: + raise ValueError(f"unsupported tensor dtype: {tensor_data.dtype}") from e + expected = numel * np_dtype.itemsize + if len(tensor_data.data) != expected: + raise ValueError( + f"tensor byte-size mismatch: got {len(tensor_data.data)}, expected {expected}" + ) t = torch.from_numpy( - np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape) + np.frombuffer(tensor_data.data, dtype=np_dtype, count=numel).reshape(shape) )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 750 - 772, The _tensor_from_proto function should validate tensor_data.dtype, tensor_data.shape, and the byte-length of tensor_data.data before calling np.frombuffer/reshape: ensure shape is a sequence of non-negative ints and compute the element count (product of shape); map the declared dtype (special-case "bfloat16" → uint16 element size 2) to a numpy dtype and determine expected_bytes = element_size * element_count, then verify len(tensor_data.data) == expected_bytes and raise a ValueError with a clear message if the dtype is unknown, shape is invalid, or the byte length mismatches; perform these checks before any np.frombuffer/reshape calls in _tensor_from_proto so malformed payloads raise deterministic ValueError client errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 579-583: The handler currently treats a present mm_inputs without
pixel_values as text-only; instead, in the block that checks
request.HasField("mm_inputs") you should detect the malformed partial multimodal
payload (i.e., request.mm_inputs is present but not
request.mm_inputs.HasField("pixel_values")) and raise a ValueError so the RPC
returns INVALID_ARGUMENT; otherwise, continue to call
_mm_inputs_from_proto(request.mm_inputs,
model_dtype=getattr(self.async_llm.model_config, "dtype", None)) as before.
---
Duplicate comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 750-772: The _tensor_from_proto function should validate
tensor_data.dtype, tensor_data.shape, and the byte-length of tensor_data.data
before calling np.frombuffer/reshape: ensure shape is a sequence of non-negative
ints and compute the element count (product of shape); map the declared dtype
(special-case "bfloat16" → uint16 element size 2) to a numpy dtype and determine
expected_bytes = element_size * element_count, then verify len(tensor_data.data)
== expected_bytes and raise a ValueError with a clear message if the dtype is
unknown, shape is invalid, or the byte length mismatches; perform these checks
before any np.frombuffer/reshape calls in _tensor_from_proto so malformed
payloads raise deterministic ValueError client errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1893392f-f6e6-4a57-b70f-7908560af4fc
📒 Files selected for processing (8)
crates/grpc_client/proto/tokenspeed_scheduler.protocrates/grpc_client/src/tokenspeed_scheduler.rsgrpc_servicer/smg_grpc_servicer/tokenspeed/server.pygrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pymodel_gateway/src/routers/grpc/client.rsmodel_gateway/src/routers/grpc/harmony/stages/request_building.rsmodel_gateway/src/routers/grpc/multimodal.rsmodel_gateway/src/routers/grpc/proto_wrapper.rs
| if request.HasField("mm_inputs") and request.mm_inputs.HasField("pixel_values"): | ||
| precomputed_mm = self._mm_inputs_from_proto( | ||
| request.mm_inputs, | ||
| model_dtype=getattr(self.async_llm.model_config, "dtype", None), | ||
| ) |
There was a problem hiding this comment.
Reject partial multimodal payloads instead of silently dropping them.
When mm_inputs is present but pixel_values is missing, the request currently falls back to text-only execution. That hides malformed payloads and can produce incorrect results; fail fast with INVALID_ARGUMENT by raising ValueError.
Proposed fix
- precomputed_mm = None
- if request.HasField("mm_inputs") and request.mm_inputs.HasField("pixel_values"):
- precomputed_mm = self._mm_inputs_from_proto(
- request.mm_inputs,
- model_dtype=getattr(self.async_llm.model_config, "dtype", None),
- )
+ precomputed_mm = None
+ if request.HasField("mm_inputs"):
+ if not request.mm_inputs.HasField("pixel_values"):
+ raise ValueError(
+ "GenerateRequest.mm_inputs.pixel_values is required when mm_inputs is set"
+ )
+ precomputed_mm = self._mm_inputs_from_proto(
+ request.mm_inputs,
+ model_dtype=getattr(self.async_llm.model_config, "dtype", None),
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` around lines 579 -
583, The handler currently treats a present mm_inputs without pixel_values as
text-only; instead, in the block that checks request.HasField("mm_inputs") you
should detect the malformed partial multimodal payload (i.e., request.mm_inputs
is present but not request.mm_inputs.HasField("pixel_values")) and raise a
ValueError so the RPC returns INVALID_ARGUMENT; otherwise, continue to call
_mm_inputs_from_proto(request.mm_inputs,
model_dtype=getattr(self.async_llm.model_config, "dtype", None)) as before.
Extend the TokenSpeed gRPC path to carry the gateway's preprocessed multimodal tensors, so an image request routed to a TokenSpeed worker is served instead of rejected at the router. - proto: add MultimodalInputs (pixel_values + model-specific tensors + placeholder offsets + im_token_id), GenerateRequest.mm_inputs, and GetModelInfoResponse.supports_vision. - gateway: route the per-engine multimodal assembly to a new MultimodalData::TokenSpeed variant; the client forwards the preprocessed tensors instead of returning an error. - servicer: reconstruct the proto into the engine's MultimodalInputs and set precomputed_multimodal_inputs so the engine skips its own preprocessing; raise the gRPC message-size ceiling for large VLM pixel_values. Signed-off-by: chenht2022 <chenht2022@gmail.com>
5e0e753 to
73e54b5
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/server.py`:
- Around line 36-38: Replace the duplicated hardcoded near-2GiB values for
"grpc.max_send_message_length" and "grpc.max_receive_message_length" with a
single config-backed constant and validator used by both the server and warmup
channel; add a module-level constant (e.g., MAX_GRPC_MESSAGE_BYTES) and a helper
function (e.g., validate_grpc_message_size(size)) that enforces a sane upper
bound (<= (2 << 30) - 1) and a reasonable default, load/override it from
configuration/env, and replace the two tuple entries in server.py with the
validated constant so both call sites reference the same shared value and
validation logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 68eb0c6e-7f9a-43b1-bcaa-b5bbec4d22d6
📒 Files selected for processing (8)
crates/grpc_client/proto/tokenspeed_scheduler.protocrates/grpc_client/src/tokenspeed_scheduler.rsgrpc_servicer/smg_grpc_servicer/tokenspeed/server.pygrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pymodel_gateway/src/routers/grpc/client.rsmodel_gateway/src/routers/grpc/harmony/stages/request_building.rsmodel_gateway/src/routers/grpc/multimodal.rsmodel_gateway/src/routers/grpc/proto_wrapper.rs
Kimi-K2.5 + Qwen3-VL inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>
Kimi-K2.5 + Qwen3.5 (with the Qwen3-VL vision tower) inference via SMG gateway inputs (lightseekorg/smg#1515). OCRBench-validated. Signed-off-by: chenht2022 <chenht2022@gmail.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0bc7e0fdb2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from tokenspeed.runtime.multimodal.inputs import ( | ||
| Modality, | ||
| MultimodalDataItem, | ||
| MultimodalInputs, | ||
| ) |
There was a problem hiding this comment.
Move TokenSpeed multimodal imports behind lazy loading
This file explicitly documents that it should remain importable in environments that stub TokenSpeed (_lazy_generate_req_input comment), but these new top-level imports force tokenspeed to be installed at module import time. In test/tooling contexts that only exercise request-conversion code (and intentionally don’t install the full TokenSpeed runtime), importing this module now fails with ModuleNotFoundError before any lazy path can run. Please load these symbols lazily inside the multimodal conversion path (or gate them similarly to other type-only/lazy imports) to preserve the existing import-time contract.
Useful? React with 👍 / 👎.
0bc7e0f to
bc99de6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc99de62c9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| t = torch.from_numpy( | ||
| np.frombuffer(tensor_data.data, dtype=np.dtype(tensor_data.dtype)).reshape(shape) | ||
| ) |
There was a problem hiding this comment.
Handle uint32 TensorData without torch.from_numpy
_tensor_from_proto decodes non-bfloat tensors via np.frombuffer(..., dtype=np.dtype(tensor_data.dtype)) and then calls torch.from_numpy(...). In this commit, TensorData explicitly allows "uint32", and the gateway can forward uint32 model-specific tensors, but torch.from_numpy does not accept np.uint32; that path raises a TypeError and the RPC is surfaced as INTERNAL instead of successfully building the request. Any multimodal model that includes unsigned tensor metadata will fail at request conversion unless uint32 is converted through a supported torch dtype path.
Useful? React with 👍 / 👎.
01f39bb to
a3aa2be
Compare
7dd0452 to
58fb125
Compare
Addresses the open review comments on the multimodal servicer/server: - ``servicer.py``: pull ``torch``, ``numpy``, and the ``tokenspeed.runtime.multimodal.inputs`` symbols out of in-function / ``TYPE_CHECKING`` lazy blocks and up to the regular top-of-file import section. The remaining ``TYPE_CHECKING`` block keeps only ``AsyncLLM`` / ``ServerArgs`` since those are still type-only. - ``server.py``: replace the two open-coded ``(2 << 30) - 1`` gRPC ``max_send/receive_message_length`` literals with a single ``_grpc_max_message_bytes`` helper backed by ``TOKENSPEED_GRPC_MAX_MESSAGE_BYTES``. Default still tops out at the gRPC hard ceiling (2 GiB - 1) so VLM ``pixel_values`` aren't refused, but deployments can now lower it from the outside; bad values fall back to the default with a warning. Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>
58fb125 to
818d022
Compare
The simplified install path runs ``uv pip install -e ... --no-build-isolation`` on three TokenSpeed packages, but neither ``./python`` nor ``tokenspeed-kernel`` lists ``setuptools`` in their ``build-system.requires``. Without build isolation the build backend resolves those imports from the active venv — uv fails with ``No module named 'setuptools'``. Preseed the three build-time tooling packages (``setuptools``, ``wheel``, ``pybind11``) before the editable installs. Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>
The previous commit installed ``./python`` first, but its runtime dependencies transitively pull in ``tokenspeed-triton``, populating ``site-packages/tokenspeed_triton/backends/nvidia/include`` before the kernel build runs. tokenspeed-kernel's ``setup.py`` walks site-packages and adds that path to nvcc's ``-I`` list — and the bundled ``crt/host_runtime.h`` defines ``__cudaLaunch`` as a CUDA-12 1-arg macro that conflicts with the 2-arg call nvcc 13.0 emits in its stub (``error: macro "__cudaLaunch" passed 2 arguments, but takes just 1``). Build the kernel first so its setup.py walk sees a clean site-packages, then install scheduler + ``./python``. Mirrors the ``kernel → scheduler → engine`` ordering in tokenspeed's own ``test/ci_system/install_deps.sh``. Signed-off-by: zhyncs <46627482+zhyncs@users.noreply.github.com>
Description
Problem
TokenSpeed is wired into the SMG gateway for text generation, but its gRPC path carries no multimodal inputs: an image request routed to a TokenSpeed worker is rejected at the router ("TokenSpeed backend does not support multimodal inputs").
Solution
Extend the TokenSpeed gRPC path to carry the gateway's already-preprocessed multimodal tensors. The gateway performs per-model image preprocessing in
crates/multimodal; the servicer reconstructs the tensors and setsprecomputed_multimodal_inputsso the engine consumes them directly and skips its own preprocessing.Changes
crates/grpc_client/proto/tokenspeed_scheduler.proto: addMultimodalInputs/TensorData/PlaceholderRange,GenerateRequest.mm_inputs, andGetModelInfoResponse.supports_vision.crates/grpc_client/src/tokenspeed_scheduler.rs,model_gateway/src/routers/grpc/{client,multimodal,proto_wrapper}.rs: assemble and forward TokenSpeed multimodal requests via a newMultimodalData::TokenSpeedvariant.grpc_servicer/smg_grpc_servicer/tokenspeed/{servicer,server}.py: decode the precomputed multimodal payload into the engine's inputs; raise the gRPC message-size ceiling for large VLM pixel_values.Test Plan
cargo +nightly fmt --allandcargo clippy --workspace --all-targets --all-features -- -D warningspass; Pythonruff format/ruff checkare clean. (cargo testruns in CI.)MultimodalInputsmessage.crates/multimodalcrate are untouched.Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Chores