feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations#1524
feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations#1524zach-li-sudo wants to merge 14 commits into
Conversation
…lightseekorg#1099) Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
…d no-ops on non-MLX Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
… path MLX Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds MLX string stop support: tokenizes user stop strings into MLX stop_token_ids, resolves MLX matched-stop token IDs back into user-facing values via request context and tokenizer, wires this into request builders and response/streaming paths, and removes legacy MLX stop-string rejection. ChangesMLX Stop Sequence Processing Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
There was a problem hiding this comment.
Code Review
This pull request enables support for string stop sequences in the MLX backend by tokenizing them into single-token IDs during the request preparation stage. It also introduces logic to map the matched stop token ID back to its original string or numeric representation in API responses for both regular and streaming workflows. Feedback was provided regarding the efficiency of tokenizing stop strings within the response processing loop, suggesting that pre-tokenizing or caching these values could improve performance in high-throughput scenarios.
| // Check stop strings first: find the string that tokenizes to this single token. | ||
| if let Some(stop_strings) = stop { | ||
| for s in stop_strings.iter() { | ||
| if let Ok(enc) = tokenizer.encode(s, false) { |
There was a problem hiding this comment.
Tokenizing stop strings in a loop for every completion response can be inefficient, especially in high-throughput scenarios. While the number of stop sequences is typically small (OpenAI limits to 4), consider pre-tokenizing these strings during the request building stage and passing the mapping down to the response processor, or at least caching the results if the tokenizer is shared.
…prevent misuse Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
…emand loading Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
cd5ebaa to
6e76676
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e76676189
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Self::reject_constraint(constraint.as_ref())?; | ||
| Self::reject_n(body.n)?; | ||
| Self::reject_stop_strings(body.stop.as_ref().is_some_and(|s| !s.is_empty()))?; | ||
| Self::reject_response_format(body.response_format.is_some())?; | ||
|
|
||
| let sampling_params = Self::build_sampling_params_from_chat(body); |
There was a problem hiding this comment.
Reject unconverted stop strings in MLX request builders
Removing stop-string rejection in this builder allows direct callers to pass stop values that are never converted into sampling_params.stop_token_ids. That conversion now happens only in the pipeline stages, but some call paths still build requests directly (for example via GrpcClient::build_generate_request_from_chat in the Go policy binding), so MLX will silently ignore those string stops instead of enforcing them. This is a behavior regression from fail-fast (400) to silent no-op, which can produce longer-than-requested outputs and wrong stop semantics.
Useful? React with 👍 / 👎.
MLX Stop Sequence Support: Full Pipeline Test GuideBranch: ScopeSix pipeline/path combinations. All string stop sequence support is new in this branch.
Revision comparison
Switch between builds# Baseline
git checkout 9a93938a && cargo build
# HEAD
git checkout stream-all-backend && cargo buildSetupMLX (Apple Silicon only)Install Python deps once: source .venv/bin/activate
pip install -e ./crates/grpc_client/python
pip install -e "./grpc_servicer[mlx]"MLX worker — regular model (tests 1–4): source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
--model mlx-community/Qwen3-0.6B-4bit --port 50051MLX worker — Harmony model (tests 5–6; stop the regular worker first): source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
--model mlx-community/gpt-oss-20b-MXFP4-Q4 --port 50051vLLMvLLM worker — regular model (tests 1–4): python -m vllm.entrypoints.grpc_server --model Qwen/Qwen2.5-1.5B-Instruct --port 50051vLLM worker — Harmony model (tests 5–6): python -m vllm.entrypoints.grpc_server --model openai/gpt-oss-20b --port 50051Gateway (same for both backends)./target/debug/smg --worker-urls grpc://localhost:50051 --port 3000Smoke testcurl http://localhost:3000/v1/models | jq '.data[].id'Token reference (Qwen tokenizer — shared by Qwen3-0.6B and GPT-OSS)
Baseline quick-checkMLXBuild at # 1. Chat
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-0.6B-4bit","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200,"thinking":{"type":"disabled"}}' | jq .
# 2. Completion
curl http://localhost:3000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-0.6B-4bit","prompt":"1\n2\n3\n4\n","stop":["6"],"max_tokens":200}' | jq .
# 3. Messages
curl http://localhost:3000/v1/messages \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-0.6B-4bit","messages":[{"role":"user","content":"Count 1-10"}],"stop_sequences":["6"],"max_tokens":200,"thinking":{"type":"disabled"}}' | jq .
# 4. Generate
curl http://localhost:3000/generate \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-0.6B-4bit","text":"1\n2\n3\n4\n","sampling_params":{"stop":["6"],"max_new_tokens":200}}' | jq .
# 5. Harmony Chat [requires Harmony model worker]
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .
# 6. Harmony Responses [requires Harmony model worker]
curl http://localhost:3000/v1/responses \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","input":"Count 1-10","stop":["6"],"max_output_tokens":200}' | jq .Baseline result (all six): HEAD result (all six): vLLMRun these six commands at either revision — all should return # 1. Chat
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .
# 2. Completion
curl http://localhost:3000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"1\n2\n3\n4\n","stop":["6"],"max_tokens":200}' | jq .
# 3. Messages
curl http://localhost:3000/v1/messages \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Count 1-10"}],"stop_sequences":["6"],"max_tokens":200}' | jq .
# 4. Generate
curl http://localhost:3000/generate \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","text":"1\n2\n3\n4\n","sampling_params":{"stop":["6"],"max_new_tokens":200}}' | jq .
# 5. Harmony Chat [requires Harmony model worker]
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .
# 6. Harmony Responses [requires Harmony model worker]
curl http://localhost:3000/v1/responses \
-H "Content-Type: application/json" \
-d '{"model":"openai/gpt-oss-20b","input":"Count 1-10","stop":["6"],"max_output_tokens":200}' | jq .vLLM result (all six): 1. Regular Chat (
|
| Symbol | Meaning |
|---|---|
| ✅ | HTTP 200, correct behavior |
| ❌ | HTTP 4xx/5xx |
| HTTP 200 but incorrect behavior |
String stop tests — core before/after comparison
| # | Pipeline | Path | Stop | MLX baseline | MLX HEAD | vLLM |
|---|---|---|---|---|---|---|
| 1.1.1 | Regular Chat | non-stream | single-token "6" |
❌ 400 | ✅ matched_stop:"6" |
✅ |
| 1.1.2 | Regular Chat | non-stream | multi-token "hello world" |
❌ 400 | ❌ 400 (diff msg) | ✅ |
| 1.1.4 | Regular Chat | non-stream | "5" + ids [21] |
❌ 400 | ✅ matched_stop:"5" |
✅ |
| 1.2.1 | Regular Chat | stream | single-token "6" |
❌ 400 | ✅ SSE matched_stop:"6" |
✅ |
| 1.2.4 | Regular Chat | stream | "5" + ids [21] |
❌ 400 | ✅ matched_stop:"5" |
✅ |
| 2.1.1 | Regular Completion | non-stream | single-token "6" |
❌ 400 | ✅ matched_stop:"6" |
✅ |
| 2.1.4 | Regular Completion | non-stream | "5" + ids [21] |
❌ 400 | ✅ matched_stop:"5" |
✅ |
| 2.2.1 | Regular Completion | stream | single-token "6" |
❌ 400 | ✅ (no matched_stop) |
✅ |
| 3.1.1 | Regular Messages | non-stream | single-token "6" |
❌ 400 | ✅ stop_sequence:"6" |
✅ |
| 3.2.1 | Regular Messages | stream | single-token "6" |
❌ 400 | ✅ SSE stop_sequence:"6" |
✅ |
| 4.1.1 | Regular Generate | non-stream | single-token "6" |
❌ 400 | ✅ matched_stop:21 (int) |
✅ (string) |
| 4.2.1 | Regular Generate | stream | single-token "6" |
❌ 400 | ✅ (stop token in text, no matched_stop) |
✅ |
| 5.1.1 | Harmony Chat | non-stream | single-token "6" |
❌ 400 | ✅ matched_stop:21 (int) |
✅ (string) |
| 5.2.1 | Harmony Chat | stream | single-token "6" |
❌ 400 | ✅ SSE matched_stop:21 |
✅ |
| 6.1.1 | Harmony Responses | non-stream | single-token "6" |
❌ 400 | ✅ status:"completed" |
|
| 6.2.1 | Harmony Responses | stream | single-token "6" |
❌ 400 | ✅ SSE response.completed |
stop_token_ids regression — must pass at both revisions
| # | Pipeline | Path | Stop | MLX baseline | MLX HEAD |
|---|---|---|---|---|---|
| 1.1.3 | Regular Chat | non-stream | ids [20,21] |
✅ matched_stop:20 |
✅ matched_stop:20 |
| 1.2.3 | Regular Chat | stream | ids [20,21] |
✅ matched_stop:20 |
✅ matched_stop:20 |
| 2.1.3 | Regular Completion | non-stream | ids [20,21] |
✅ matched_stop:20 |
✅ matched_stop:20 |
| 4.1.3 | Regular Generate | non-stream | ids [20,21] |
✅ matched_stop:20 |
✅ matched_stop:20 |
| 5.1.3 | Harmony Chat | non-stream | ids [20,21] |
✅ matched_stop:20 |
✅ matched_stop:20 |
|
Here's the e2e test PR for this feature: #1538 |
Description
Problem
Follow-up on this PR #1447
Major feat: String stop sequence support for MLX on all 6 pipeline/path combinations
Minor fix: add the missing
matched_stopfield in regular completion with stream of all backends (vLLM, MLX etc)/v1/chat/completions/v1/completions/v1/messages/generate/v1/chat/completions+ GPT-OSS/v1/responses+ GPT-OSSSolution
Major feat: convert stop strings into stop token ids, then pass to MLX backend
Minor fix: add
matched_stopfield in the last stream chunksChanges
see diff
Test Plan
1. MLX string stop sequence support (all 6 pipeline/path combinations)
✅= HTTP 200 correct result ·❌= HTTP 400/v1/chat/completions"stop": ["6"](single-token)matched_stop: "6"/v1/chat/completions"stop": ["hello world"](multi-token)unsupported_stop_string/v1/chat/completions"stop_token_ids": [20, 21]matched_stop: 20/v1/completions"stop": ["6"](single-token)matched_stop: "6"/v1/completions"stop": ["hello world"](multi-token)unsupported_stop_string/v1/completions"stop_token_ids": [20, 21]matched_stop: 20/v1/messages"stop_sequences": ["6"](single-token)stop_sequence: "6"/v1/messages"stop_sequences": ["hello world"](multi-token)unsupported_stop_string/generate"stop": ["6"](single-token)matched_stop: 21¹/generate"stop": ["hello world"](multi-token)unsupported_stop_string/v1/chat/completions"stop": ["6"](single-token)matched_stop: 21¹/v1/chat/completions"stop_token_ids": [20]matched_stop: 20/v1/responses"stop": ["6"](single-token)¹
matched_stopon/generateand Harmony paths returns the raw token ID integer, not the original string. The tokenizer is lazy-loaded into the pipeline context (ctx.state.tokenizer) during request building to convert stop strings → token IDs, but the response processors on these paths do not receive the pipeline context and therefore cannot reverse-map the token ID back to the original string.² Harmony Responses API has no top-level
matched_stopfield; correct stop is confirmed viastatus: "completed".2. Streaming
matched_stopwas previously absent from all streaming/v1/completionschunks for all backends — fixed. Other paths are MLX-only new support./v1/chat/completions"stop": ["6"](single-token)matched_stop: "6"/v1/chat/completions"stop_token_ids": [20, 21]matched_stop: 20/v1/completions"stop": ["6"]matched_stop: "6"(was missing)/v1/completions"stop_token_ids": [20, 21]matched_stop: 20(was missing)/v1/completions"stop": ["5"]+"stop_token_ids": [21]matched_stop: "5"(was missing)/v1/completions"stop": ["6"]matched_stop: "6"(was missing)/v1/completions"stop_token_ids": [20, 21]matched_stop: 20(was missing)/v1/completions"stop": ["5"]+"stop_token_ids": [21]matched_stop: "5"(was missing)/v1/messages"stop_sequences": ["6"](single-token)message_deltawithstop_sequence: "6"/v1/chat/completions"stop": ["6"](single-token)matched_stop: 21¹Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
Release Notes
Bug Fixes
Tests