feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations by zach-li-sudo · Pull Request #1524 · lightseekorg/smg

zach-li-sudo · 2026-05-23T21:11:45Z

Description

Problem

Follow-up on this PR #1447

Major feat: String stop sequence support for MLX on all 6 pipeline/path combinations
Minor fix: add the missing matched_stop field in regular completion with stream of all backends (vLLM, MLX etc)

#	Pipeline	API path	Change
1	Regular Chat	`/v1/chat/completions`	string stop: was HTTP 400 → now single-token supported
2	Regular Completion	`/v1/completions`	string stop: was HTTP 400 → now single-token supported
3	Regular Messages	`/v1/messages`	string stop: was HTTP 400 → now single-token supported
4	Regular Generate	`/generate`	string stop: was HTTP 400 → now single-token supported
5	Harmony Chat	`/v1/chat/completions` + GPT-OSS	string stop: was HTTP 400 → now single-token supported
6	Harmony Responses	`/v1/responses` + GPT-OSS	string stop: was HTTP 400 → now single-token supported

Solution

Major feat: convert stop strings into stop token ids, then pass to MLX backend
Minor fix: add matched_stop field in the last stream chunks

Changes

see diff

Test Plan

Unit tests for newly added helper functions for string/token id conversion
Deployed MLX gRPC + SMG with regular model (Qwen3-4B) and GPT-OSS harmony model (GPT-OSS-20B) and tested with the following scenarios:

1. MLX string stop sequence support (all 6 pipeline/path combinations)

✅ = HTTP 200 correct result · ❌ = HTTP 400

#	Pipeline	Path	Stop input	Result
1	Regular Chat	`/v1/chat/completions`	`"stop": ["6"]` (single-token)	✅ `matched_stop: "6"`
1	Regular Chat	`/v1/chat/completions`	`"stop": ["hello world"]` (multi-token)	❌ `unsupported_stop_string`
1	Regular Chat	`/v1/chat/completions`	`"stop_token_ids": [20, 21]`	✅ `matched_stop: 20`
2	Regular Completion	`/v1/completions`	`"stop": ["6"]` (single-token)	✅ `matched_stop: "6"`
2	Regular Completion	`/v1/completions`	`"stop": ["hello world"]` (multi-token)	❌ `unsupported_stop_string`
2	Regular Completion	`/v1/completions`	`"stop_token_ids": [20, 21]`	✅ `matched_stop: 20`
3	Regular Messages	`/v1/messages`	`"stop_sequences": ["6"]` (single-token)	✅ `stop_sequence: "6"`
3	Regular Messages	`/v1/messages`	`"stop_sequences": ["hello world"]` (multi-token)	❌ `unsupported_stop_string`
4	Regular Generate	`/generate`	`"stop": ["6"]` (single-token)	✅ `matched_stop: 21` ¹
4	Regular Generate	`/generate`	`"stop": ["hello world"]` (multi-token)	❌ `unsupported_stop_string`
5	Harmony Chat	`/v1/chat/completions`	`"stop": ["6"]` (single-token)	✅ `matched_stop: 21` ¹
5	Harmony Chat	`/v1/chat/completions`	`"stop_token_ids": [20]`	✅ `matched_stop: 20`
6	Harmony Responses	`/v1/responses`	`"stop": ["6"]` (single-token)	✅ stop fires correctly ²

¹ matched_stop on /generate and Harmony paths returns the raw token ID integer, not the original string. The tokenizer is lazy-loaded into the pipeline context (ctx.state.tokenizer) during request building to convert stop strings → token IDs, but the response processors on these paths do not receive the pipeline context and therefore cannot reverse-map the token ID back to the original string.
² Harmony Responses API has no top-level matched_stop field; correct stop is confirmed via status: "completed".

2. Streaming

matched_stop was previously absent from all streaming /v1/completions chunks for all backends — fixed. Other paths are MLX-only new support.

#	Pipeline	Path	Backend	Stop input	Result
1	Regular Chat	`/v1/chat/completions`	MLX	`"stop": ["6"]` (single-token)	✅ final chunk `matched_stop: "6"`
1	Regular Chat	`/v1/chat/completions`	MLX	`"stop_token_ids": [20, 21]`	✅ `matched_stop: 20`
2	Regular Completion	`/v1/completions`	MLX	`"stop": ["6"]`	✅ `matched_stop: "6"` (was missing)
2	Regular Completion	`/v1/completions`	MLX	`"stop_token_ids": [20, 21]`	✅ `matched_stop: 20` (was missing)
2	Regular Completion	`/v1/completions`	MLX	`"stop": ["5"]` + `"stop_token_ids": [21]`	✅ `matched_stop: "5"` (was missing)
2	Regular Completion	`/v1/completions`	vLLM	`"stop": ["6"]`	✅ `matched_stop: "6"` (was missing)
2	Regular Completion	`/v1/completions`	vLLM	`"stop_token_ids": [20, 21]`	✅ `matched_stop: 20` (was missing)
2	Regular Completion	`/v1/completions`	vLLM	`"stop": ["5"]` + `"stop_token_ids": [21]`	✅ `matched_stop: "5"` (was missing)
3	Regular Messages	`/v1/messages`	MLX	`"stop_sequences": ["6"]` (single-token)	✅ `message_delta` with `stop_sequence: "6"`
5	Harmony Chat	`/v1/chat/completions`	MLX	`"stop": ["6"]` (single-token)	✅ final chunk `matched_stop: 21` ¹

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Release Notes

Bug Fixes
- String stop sequences are now fully supported on the MLX backend
- Matched stop sequence reporting in API responses now accurately reflects user-provided stop conditions across chat, completion, and messages endpoints
Tests
- Enhanced mock tokenizer with failure simulation capabilities

…lightseekorg#1099) Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

…d no-ops on non-MLX Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

… path MLX Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

coderabbitai · 2026-05-23T21:11:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4154af76-213d-404e-9427-e6ae4805a57b

📥 Commits

Reviewing files that changed from the base of the PR and between cd5ebaa and 6e76676.

📒 Files selected for processing (4)

model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/streaming.rs

📝 Walkthrough

Walkthrough

Adds MLX string stop support: tokenizes user stop strings into MLX stop_token_ids, resolves MLX matched-stop token IDs back into user-facing values via request context and tokenizer, wires this into request builders and response/streaming paths, and removes legacy MLX stop-string rejection.

Changes

MLX Stop Sequence Processing Pipeline

Layer / File(s)	Summary
Stop conversion and resolution utilities `model_gateway/src/routers/grpc/utils/chat_utils.rs`, `model_gateway/src/routers/grpc/utils/mod.rs`	`stop_strings_to_token_ids`, `resolve_mlx_matched_stop_json`, and `resolve_mlx_stop_ids` convert stop strings to single-token IDs, map matched MLX token IDs back to user JSON (string preferred), and validate tokenizer availability; includes unit tests and HTTP 400 error mapping.
Proto wrapper context-aware matching `model_gateway/src/routers/grpc/proto_wrapper.rs`, `model_gateway/src/routers/grpc/regular/processor.rs`	Introduces `matched_stop_json_with_context(...)` that resolves MLX matched-stop token IDs using stop strings/stop_token_ids and a tokenizer; processor paths now call this method for chat, messages, and completion responses.
Request-building integration `model_gateway/src/routers/grpc/common/stages/helpers.rs`, `model_gateway/src/routers/grpc/harmony/stages/request_building.rs`, `model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs`, `model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs`, `model_gateway/src/routers/grpc/regular/stages/generate/request_building.rs`, `model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs`	Adds `apply_mlx_stop_sequences` to tokenize optional string stops and append token IDs to MLX `sampling_params.stop_token_ids`; chat, completion, generate, messages, and Harmony builders call this helper using the cached tokenizer in context.
Completion streaming finalization `model_gateway/src/routers/grpc/regular/streaming.rs`	Defers final `finish_reason` emission when local stop decoder fires in Chunk events so the subsequent Complete event can include backend `matched_stop_json_with_context()`; simplifies `CompletionStreamChoice` construction using `Default`.
Legacy cleanup and testing support `crates/grpc_client/src/mlx_engine.rs`, `crates/protocols/src/completion.rs`, `crates/tokenizer/src/mock.rs`	Removes `reject_stop_strings()` checks/TODO from MLX engine builders. `CompletionStreamChoice` derives `Default`. `MockTokenizer` adds `fail_encode: bool` and `failing()` for negative tokenization tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

lightseekorg/smg#978: Related changes to completion streaming SSE and finish_reason handling.
lightseekorg/smg#915: Related modifications to completions request-building paths and sampling/stop handling.
lightseekorg/smg#602: Prior changes to proto matched_stop JSON handling across backends.

Suggested labels

tests

Suggested reviewers

CatherineSue
key4ng
slin1237

Poem

🐰 In tunnels of code I hop and sing,
Stops once banned now wear a ring.
Tokenize, resolve, stitch the flow,
From request to finish—matched stops show.
A tiny hop, a testing spring.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main feature: string stop sequence support for MLX across six pipeline/path combinations, matching the core changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mergify · 2026-05-23T21:12:20Z

Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

gemini-code-assist

Code Review

This pull request enables support for string stop sequences in the MLX backend by tokenizing them into single-token IDs during the request preparation stage. It also introduces logic to map the matched stop token ID back to its original string or numeric representation in API responses for both regular and streaming workflows. Feedback was provided regarding the efficiency of tokenizing stop strings within the response processing loop, suggesting that pre-tokenizing or caching these values could improve performance in high-throughput scenarios.

gemini-code-assist · 2026-05-23T21:17:46Z

+    // Check stop strings first: find the string that tokenizes to this single token.
+    if let Some(stop_strings) = stop {
+        for s in stop_strings.iter() {
+            if let Ok(enc) = tokenizer.encode(s, false) {


Tokenizing stop strings in a loop for every completion response can be inefficient, especially in high-throughput scenarios. While the number of stop sequences is typically small (OpenAI limits to 4), consider pre-tokenizing these strings during the request building stage and passing the mapping down to the response processor, or at least caching the results if the tokenizer is shared.

…prevent misuse Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

…emand loading Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e76676189

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T21:52:47Z

        Self::reject_constraint(constraint.as_ref())?;
        Self::reject_n(body.n)?;
-        Self::reject_stop_strings(body.stop.as_ref().is_some_and(|s| !s.is_empty()))?;
        Self::reject_response_format(body.response_format.is_some())?;

        let sampling_params = Self::build_sampling_params_from_chat(body);


Reject unconverted stop strings in MLX request builders

Removing stop-string rejection in this builder allows direct callers to pass stop values that are never converted into sampling_params.stop_token_ids. That conversion now happens only in the pipeline stages, but some call paths still build requests directly (for example via GrpcClient::build_generate_request_from_chat in the Go policy binding), so MLX will silently ignore those string stops instead of enforcing them. This is a behavior regression from fail-fast (400) to silent no-op, which can produce longer-than-requested outputs and wrong stop semantics.

Useful? React with 👍 / 👎.

zach-li-sudo · 2026-05-24T20:10:34Z

MLX Stop Sequence Support: Full Pipeline Test Guide

Branch: stream-all-backend
Companion doc: MLX-string-stop-all-paths.md (narrative/what changed)
Purpose: Pre-review before/after comparison — run against both the baseline revision and HEAD.

Scope

Six pipeline/path combinations. All string stop sequence support is new in this branch.

#	Pipeline	API path	Change
1	Regular Chat	`/v1/chat/completions`	string stop: was HTTP 400 → now single-token supported
2	Regular Completion	`/v1/completions`	string stop: was HTTP 400 → now single-token supported
3	Regular Messages	`/v1/messages`	string stop: was HTTP 400 → now single-token supported
4	Regular Generate	`/generate`	string stop: was HTTP 400 → now single-token supported
5	Harmony Chat	`/v1/chat/completions` + GPT-OSS	string stop: was HTTP 400 → now single-token supported
6	Harmony Responses	`/v1/responses` + GPT-OSS	string stop: was HTTP 400 → now single-token supported

Revision comparison

Build	Revision	String stop on MLX
Baseline	`9a93938a`	HTTP 400, `invalid_request_parameters` — `"MLX backend does not support string stop sequences"`
HEAD	current branch tip	Single-token accepted; multi-token: HTTP 400 `unsupported_stop_string` — `"stop string \"…\" encodes to N tokens; MLX backend only supports single-token stop strings"`

Switch between builds

# Baseline
git checkout 9a93938a && cargo build

# HEAD
git checkout stream-all-backend && cargo build

Setup

MLX (Apple Silicon only)

Install Python deps once:

source .venv/bin/activate
pip install -e ./crates/grpc_client/python
pip install -e "./grpc_servicer[mlx]"

MLX worker — regular model (tests 1–4):

source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
  --model mlx-community/Qwen3-0.6B-4bit --port 50051

MLX worker — Harmony model (tests 5–6; stop the regular worker first):

source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
  --model mlx-community/gpt-oss-20b-MXFP4-Q4 --port 50051

vLLM

vLLM worker — regular model (tests 1–4):

python -m vllm.entrypoints.grpc_server --model Qwen/Qwen2.5-1.5B-Instruct --port 50051

vLLM worker — Harmony model (tests 5–6):

python -m vllm.entrypoints.grpc_server --model openai/gpt-oss-20b --port 50051

Gateway (same for both backends)

./target/debug/smg --worker-urls grpc://localhost:50051 --port 3000

Smoke test

curl http://localhost:3000/v1/models | jq '.data[].id'

Token reference (Qwen tokenizer — shared by Qwen3-0.6B and GPT-OSS)

Token ID	Text
20	`"5"`
21	`"6"`
198	`"\n"`

Qwen3 thinking mode: /v1/messages and /v1/chat/completions with Qwen3-0.6B-4bit need
"thinking": {"type": "disabled"} to prevent burning max_tokens on <think> tokens.
Not needed for /generate (no chat template) or GPT-OSS models or vLLM (Qwen2.5-1.5B).

Baseline quick-check

MLX

Build at 9a93938a and run these six commands — all must return 400.
Switch to HEAD build — all must return 200.

# 1. Chat
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200,"thinking":{"type":"disabled"}}' | jq .

# 2. Completion
curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","prompt":"1\n2\n3\n4\n","stop":["6"],"max_tokens":200}' | jq .

# 3. Messages
curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","messages":[{"role":"user","content":"Count 1-10"}],"stop_sequences":["6"],"max_tokens":200,"thinking":{"type":"disabled"}}' | jq .

# 4. Generate
curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","text":"1\n2\n3\n4\n","sampling_params":{"stop":["6"],"max_new_tokens":200}}' | jq .

# 5. Harmony Chat  [requires Harmony model worker]
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .

# 6. Harmony Responses  [requires Harmony model worker]
curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","input":"Count 1-10","stop":["6"],"max_output_tokens":200}' | jq .

Baseline result (all six):

# 1. Chat
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 2. Completion
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 3. Messages
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 4. Generate
{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

# 5. Harmony Chat
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 6. Harmony Responses
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

HEAD result (all six):

# 1. Chat
{
  "id": "chatcmpl-019e51a2-26d8-7cb1-a462-52195b649218",
  "object": "chat.completion",
  "created": 1779486041,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, the user wants me to count from 1 to 10. Let me start by writing down the numbers in order: 1, 2, 3, 4, 5,"
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 46,
    "total_tokens": 60
  },
  "system_fingerprint": "default"
}

# 2. Completion
{
  "id": "cmpl_019e51a2-2788-7c43-8b42-8d5c9e62e07a",
  "object": "text_completion",
  "created": 1779486041,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "choices": [
    {
      "text": "5\n",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 3,
    "total_tokens": 11
  },
  "system_fingerprint": "default"
}

# 3. Messages
{
  "id": "msg_019e51a2-27ad-7d02-800d-bd98b8ac390b",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Okay, so I need to count from 1 to 10. Let me start with 1. I'm going to count one after another. So, 1, 2, 3, 4, 5, "
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "usage": {
    "input_tokens": 18,
    "output_tokens": 50
  }
}

# 4. Generate
[
  {
    "text": "5\n",
    "output_ids": [
      20,
      198,
      21
    ],
    "meta_info": {
      "id": "gen-019e51a2-284e-7ae1-8e8e-9e69343cb6b5",
      "finish_reason": {
        "type": "stop"
      },
      "prompt_tokens": 8,
      "weight_version": "default",
      "completion_tokens": 3,
      "cached_tokens": 0,
      "e2e_latency": 0.000055709,
      "matched_stop": 21
    }
  }
]
NOTE: matched_stop is integer 21 (token ID for "6"), not the string "6" — known limitation L1.

# 5. Harmony Chat   — SKIPPED (Harmony model worker not running)
# 6. Harmony Responses — SKIPPED (Harmony model worker not running)

vLLM

Run these six commands at either revision — all should return 200 (vLLM supports string stops natively, no change across revisions).

# 1. Chat
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .

# 2. Completion
curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"1\n2\n3\n4\n","stop":["6"],"max_tokens":200}' | jq .

# 3. Messages
curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Count 1-10"}],"stop_sequences":["6"],"max_tokens":200}' | jq .

# 4. Generate
curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","text":"1\n2\n3\n4\n","sampling_params":{"stop":["6"],"max_new_tokens":200}}' | jq .

# 5. Harmony Chat  [requires Harmony model worker]
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .

# 6. Harmony Responses  [requires Harmony model worker]
curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","input":"Count 1-10","stop":["6"],"max_output_tokens":200}' | jq .

vLLM result (all six):

# 1. Chat
{
  "id": "chatcmpl-019e4d80-ccd6-7011-abbc-e9d1eb55444d",
  "object": "chat.completion",
  "created": 1779416747,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here is the count from 1 to 10:\n\n1, 2, 3, 4, 5, ",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 35,
    "completion_tokens": 28,
    "total_tokens": 63,
    "prompt_tokens_details": { "cached_tokens": 32 }
  },
  "system_fingerprint": "default"
}

# 2. Completion
{
  "id": "cmpl_019e4d80-ddcd-7493-a43c-136420ddf72b",
  "object": "text_completion",
  "created": 1779416751,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "text": "5\n",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": { "prompt_tokens": 8, "completion_tokens": 3, "total_tokens": 11 },
  "system_fingerprint": "default"
}

# 3. Messages
{
  "id": "msg_019e4d80-f0ac-7b23-98b4-36628e0614c9",
  "type": "message",
  "role": "assistant",
  "content": [ { "type": "text", "text": "Sure! Here's the count from 1 to 10:\n\n1, 2, 3, 4, 5, " } ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "usage": { "input_tokens": 35, "output_tokens": 30 }
}

# 4. Generate
[
  {
    "text": "5\n",
    "output_ids": [ 20, 198, 21 ],
    "meta_info": {
      "id": "gen-019e4d80-fed5-7bd2-a183-cfb745f181a7",
      "finish_reason": { "type": "stop" },
      "prompt_tokens": 8,
      "completion_tokens": 3,
      "matched_stop": "6"
    }
  }
]

# 5. Harmony Chat
{
  "id": "chatcmpl-019e51a4-f163-7dc3-ba6a-fef5e76a11d4",
  "object": "chat.completion",
  "created": 1779486224,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to respond to \"Count 1-10\". The user says \"Count 1-10\". Likely they want us to count from 1 to 10. So we should output numbers 1 to 10. Probably each number on new line. So answer: 1 2 3 4 5 6"
      },
      "finish_reason": "stop",
      "matched_stop": 21
    }
  ],
  "usage": {
    "prompt_tokens": 74,
    "completion_tokens": 72,
    "total_tokens": 146,
    "completion_tokens_details": {
      "reasoning_tokens": 70
    }
  },
  "system_fingerprint": "default"
}
NOTE: matched_stop is integer 21 (token ID for "6"), not the string "6" — known limitation L1.
Stop fired during reasoning content before actual output text was produced.

# 6. Harmony Responses
{
  "id": "responses-019e51a4-fa2d-7442-9bb0-12d758f42b50",
  "object": "response",
  "created_at": 1779486226,
  "status": "completed",
  "max_output_tokens": 200,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "output": [
    {
      "type": "reasoning",
      "id": "reasoning_responses-019e51a4-fa2d-7442-9bb0-12d758f42b50",
      "content": [
        {
          "type": "reasoning_text",
          "text": "We need to interpret the user request. They say: \"Count 1-10\". They want us to count from 1 to 10. Possibly they want us to count. They might want us to count inclusive of both ends. So expected output: \"1, 2, 3, 4, 5, 6"
        }
      ],
      "status": "completed"
    }
  ],
  "parallel_tool_calls": true,
  "store": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [],
  "usage": {
    "input_tokens": 70,
    "output_tokens": 72,
    "total_tokens": 142,
    "output_tokens_details": {
      "reasoning_tokens": 70
    }
  },
  "metadata": {}
}
NOTE: status "completed" confirms stop fired correctly. Stop triggered during reasoning block
before actual output message was emitted — consistent with known limitation L1 (integer matched_stop).

1. Regular Chat (`/v1/chat/completions`)

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

1.1 Non-streaming

1.1.1 Single-token string stop (`"stop": ["6"]`)

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 hello world 4 5 6 7"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq

Expected:

MLX baseline: HTTP 400 invalid_request_parameters
MLX HEAD: HTTP 200, finish_reason: "stop", matched_stop: "6", content ends before 6
vLLM: HTTP 200, finish_reason: "stop", matched_stop: "6", content ends before 6

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "id": "chatcmpl-019e5219-ac15-7453-8e07-76ba0e5c1a61",
  "object": "chat.completion",
  "created": 1779493874,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, let's see. The user is asking me to repeat the string \"1 2 3 hello world 4 5"
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 26,
    "completion_tokens": 31,
    "total_tokens": 57
  },
  "system_fingerprint": "default"
}

NOTE: Stop fired during reasoning content — content is null, reasoning_content truncates at "6". Prompt changed to "Repeat: 1 2 3 hello world 4 5 6 7" to ensure "6" appears early in reasoning within the 100-token budget.

vLLM result:

{
  "id": "chatcmpl-019e4db2-5127-7c53-bc6b-090fd23eb7c3",
  "object": "chat.completion",
  "created": 1779419992,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here is the count from 1 to 10:\n\n1  \n2  \n3  \n4  \n5  \n",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 23,
    "total_tokens": 65,
    "prompt_tokens_details": {
      "cached_tokens": 32
    }
  },
  "system_fingerprint": "default"
}

1.1.2 Multi-token string stop (`"stop": ["hello world"]`)

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq

Expected:

MLX baseline: HTTP 400 invalid_request_parameters
MLX HEAD: HTTP 400 unsupported_stop_string (still 400, different error)
vLLM: HTTP 200, finish_reason: "stop", matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "id": "chatcmpl-019e50a4-50ce-73c2-8afd-e445936604de",
  "object": "chat.completion",
  "created": 1779469406,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1 2 3 ",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": "hello world"
    }
  ],
  "usage": {
    "prompt_tokens": 44,
    "completion_tokens": 7,
    "total_tokens": 51,
    "prompt_tokens_details": {
      "cached_tokens": 16
    }
  },
  "system_fingerprint": "default"
}

1.1.3 stop_token_ids (`[20, 21]`)

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

MLX baseline: HTTP 200, matched_stop: 20
MLX HEAD: HTTP 200, matched_stop: 20
vLLM: HTTP 200, matched_stop: 20

MLX baseline result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

MLX HEAD result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

vLLM result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

1.1.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

String "5" (token 20) fires before token ID 21 ("6") — matched_stop should be the string.

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 4 5 6 7"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

MLX baseline: HTTP 400 (string stop present → rejected)
MLX HEAD: HTTP 200, matched_stop: "5" (string wins)
vLLM: HTTP 200, matched_stop: "5"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "finish_reason": "stop",
  "matched_stop": "5"
}

NOTE: Prompt changed to "Repeat: 1 2 3 4 5 6 7" to ensure "5" appears early in reasoning before "6" (token ID 21). String stop "5" fires first — matched_stop is the string.

vLLM result:

{
  "finish_reason": "stop",
  "matched_stop": "5"
}

1.1.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

MLX baseline: HTTP 400
MLX HEAD: HTTP 400 unsupported_stop_string (multi-token still rejected)
vLLM: HTTP 200, matched_stop: 20 (token ID fires first)

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

1.2 Streaming

1.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 hello world 4 5 6 7"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX baseline: HTTP 400 (no SSE stream)
MLX HEAD: SSE — final chunk finish_reason: "stop", matched_stop: "6"
vLLM: SSE — final chunk finish_reason: "stop", matched_stop: "6"

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "chatcmpl-019e521d-ba2c-7352-98d6-d5818f68a59b",
  "object": "chat.completion.chunk",
  "created": 1779494140,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ]
}

NOTE: Prompt changed to "Repeat: 1 2 3 hello world 4 5 6 7" to ensure "6" appears early in reasoning within the 100-token budget.

vLLM result:

{
  "id": "chatcmpl-019e4d86-aa85-7132-8507-59dcdb557fd0",
  "object": "chat.completion.chunk",
  "created": 1779417131,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ]
}

1.2.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX (both revisions): HTTP 400 (different error codes)
vLLM: SSE — finish_reason: "stop", matched_stop: "hello world"

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result:

{
  "id": "chatcmpl-019e4d86-ae13-7273-aa55-07722534d888",
  "object": "chat.completion.chunk",
  "created": 1779417132,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "hello world"
    }
  ]
}

1.2.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 4 5 6 7"}],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX (both revisions): SSE — finish_reason: "stop", matched_stop: 20
vLLM: SSE — finish_reason: "stop", matched_stop: 20

MLX baseline result:

{
  "id": "chatcmpl-019e50df-2873-7c60-b792-8f731327b269",
  "object": "chat.completion.chunk",
  "created": 1779473262,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ]
}
NOTE: result is non-deterministic. The streaming stop_token_ids mechanism works correctly (confirmed on re-run: finish_reason "stop", matched_stop:20), but `thinking: {"type": "disabled"}` has no effect in the baseline — Qwen3 always enters thinking mode. Whether token 20 appears within the 100-token budget varies per run; if it doesn't, the generation hits the length limit instead.

MLX HEAD result:

{
  "id": "chatcmpl-019e521d-ba2d-78a3-95ba-c943209df5a8",
  "object": "chat.completion.chunk",
  "created": 1779494140,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ]
}

NOTE: Prompt changed to "Repeat: 1 2 3 4 5 6 7" to ensure token 20 ("5") appears early in reasoning within the 100-token budget.

vLLM result:

{
  "id": "chatcmpl-019e4d86-b142-7712-9b96-46031c8cb5e0",
  "object": "chat.completion.chunk",
  "created": 1779417133,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ]
}

1.2.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 4 5 6 7"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX baseline: HTTP 400
MLX HEAD: SSE — matched_stop: "5"
vLLM: SSE — matched_stop: "5"

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "chatcmpl-019e521d-ba30-7ae0-bd0f-447bccc20bba",
  "object": "chat.completion.chunk",
  "created": 1779494140,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ]
}

NOTE: Prompt changed to "Repeat: 1 2 3 4 5 6 7" to ensure "5" appears early in reasoning before token ID 21 ("6").

vLLM result:

{
  "id": "chatcmpl-019e4d86-b9d9-7293-8c31-33a5ff2c57d9",
  "object": "chat.completion.chunk",
  "created": 1779417135,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ]
}

1.2.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX (both revisions): HTTP 400 (different error codes)
vLLM: SSE — matched_stop: 20

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result:

{
  "id": "chatcmpl-019e4d86-bcea-75c2-97a6-7e34ec843dfb",
  "object": "chat.completion.chunk",
  "created": 1779417136,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ]
}

2. Regular Completion (`/v1/completions`)

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

/v1/completions passes the prompt as raw text — no chat template, no thinking flag needed.

matched_stop in streaming: Streaming /v1/completions does not include matched_stop
in any SSE chunk for either backend. This is a pre-existing behavior (see known limitation L3).

2.1 Non-streaming

2.1.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

MLX baseline: HTTP 400
MLX HEAD: HTTP 200, text: "5\n", matched_stop: "6"
vLLM: HTTP 200, text: "5\n", matched_stop: "6"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "5\n",
  "finish_reason": "stop",
  "matched_stop": "6"
}

vLLM result:

{
  "text": "5\n",
  "finish_reason": "stop",
  "matched_stop": "6"
}

2.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

MLX (both revisions): HTTP 400 (different error codes)
vLLM: HTTP 200, matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "text": " 6\n\nSure, here is the repeated text:\n\n1 2 3 ",
  "matched_stop": "hello world"
}

2.1.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

MLX (both revisions): HTTP 200, text: "", matched_stop: 20
vLLM: HTTP 200, text: "", matched_stop: 20

MLX baseline result:

{
  "text": "",
  "matched_stop": 20
}

MLX HEAD result:

{
  "text": "",
  "matched_stop": 20
}

vLLM result:

{
  "text": "",
  "matched_stop": 20
}

2.1.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

MLX baseline: HTTP 400
MLX HEAD: HTTP 200, text: "", matched_stop: "5"
vLLM: HTTP 200, text: "", matched_stop: "5"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "",
  "matched_stop": "5"
}

vLLM result:

{
  "text": "",
  "matched_stop": "5"
}

2.1.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

MLX (both revisions): HTTP 400
vLLM: HTTP 200, text: "", matched_stop: 20

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "text": "",
  "matched_stop": 20
}

2.2 Streaming

2.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX baseline: HTTP 400
MLX HEAD: SSE — final chunk finish_reason: "stop", no matched_stop (known limitation L3)
vLLM: SSE — final chunk finish_reason: "stop", no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "cmpl_019e51cc-33ba-7872-afe1-b94b86f63b2a",
  "object": "text_completion",
  "created": 1779488797,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}
NOTE: `matched_stop: "6"` present in streaming chunk — L3 is fixed in HEAD for MLX.

vLLM result (before fix):

{
  "id": "cmpl_019e4d86-dace-7061-a150-f55edf957d8d",
  "object": "text_completion",
  "created": 1779416744,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-da31-7052-8dc8-34799b86f78f",
  "object": "text_completion",
  "created": 1779471604,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 200
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 200
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX (both revisions): HTTP 400
vLLM: SSE — finish_reason: "stop", no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result (before fix):

{
  "id": "cmpl_019e50a7-f313-78a3-bc48-95cf42d086f7",
  "object": "text_completion",
  "created": 1779469644,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c8-f112-7500-9929-9e3719837aef",
  "object": "text_completion",
  "created": 1779471806,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "hello world"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX (both revisions): SSE — stops immediately, finish_reason: "stop", no matched_stop
vLLM: same

MLX baseline result:

{
  "id": "cmpl_019e50df-a6c8-7332-ac63-38eabbebdddb",
  "object": "text_completion",
  "created": 1779473295,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}

MLX HEAD result:

{
  "id": "cmpl_019e51cc-33c1-75e3-ae52-8e5fb96233e2",
  "object": "text_completion",
  "created": 1779488797,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}
NOTE: `matched_stop: 20` present in streaming chunk — L3 fixed in HEAD.

vLLM result (before fix):

{
  "id": "cmpl_019e4d87-7b29-76b3-a921-c68388c35493",
  "object": "text_completion",
  "created": 1779417185,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-e7d3-72c0-97fc-afff84a7d92d",
  "object": "text_completion",
  "created": 1779471607,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX baseline: HTTP 400
MLX HEAD: SSE — stops immediately, no matched_stop
vLLM: SSE — stops immediately, no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "cmpl_019e51cc-33c5-7660-a9aa-1ce5e36f8d08",
  "object": "text_completion",
  "created": 1779488797,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}
NOTE: `matched_stop: "5"` present in streaming chunk — L3 fixed in HEAD.

vLLM result (before fix):

{
  "id": "cmpl_019e4d87-7d74-78b2-992d-6ee168192a78",
  "object": "text_completion",
  "created": 1779417185,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-efa1-7b31-bfa3-24e4d1fc6fc1",
  "object": "text_completion",
  "created": 1779471609,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX (both revisions): HTTP 400
vLLM: SSE — stops immediately, no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result (before fix):

{
  "id": "cmpl_019e4d87-7f04-72e2-aabe-bc0a42b80a44",
  "object": "text_completion",
  "created": 1779417186,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-f570-7710-83b7-dc1f75e26910",
  "object": "text_completion",
  "created": 1779471611,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

3. Regular Messages (`/v1/messages`) — NEW for MLX

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

The Messages API uses stop_sequences (string array only). There is no stop_token_ids field.
Test integer stop IDs on this path via the smg_sampling_params extension or use /generate.

3.1 Non-streaming

3.1.1 Single-token stop string

MLX:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"],
    "thinking": {"type": "disabled"}
  }' | jq '{stop_reason, stop_sequence, content: .content[0].text}'

vLLM:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"]
  }' | jq '{stop_reason, stop_sequence, content: .content[0].text}'

Expected:

MLX baseline: HTTP 400
MLX HEAD: HTTP 200, stop_reason: "stop_sequence", stop_sequence: "6", content ends before 6
vLLM: HTTP 200, stop_reason: "stop_sequence", stop_sequence: "6"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "content": "Starting from 1 to 10, one number per line:\n\n1  \n2  \n3  \n4  \n5  \n"
}

vLLM result:

{
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "content": "Here is the count from 1 to 10:\n\n1  \n2  \n3  \n4  \n5  \n"
}

3.1.2 Multi-token stop string

MLX:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "max_tokens": 100,
    "stop_sequences": ["hello world"]
  }' | jq .

vLLM:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "max_tokens": 100,
    "stop_sequences": ["hello world"]
  }' | jq '{stop_reason, stop_sequence}'

Expected:

MLX (both revisions): HTTP 400 (different error codes)
vLLM: HTTP 200, stop_reason: "stop_sequence", stop_sequence: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "stop_reason": "stop_sequence",
  "stop_sequence": "hello world"
}

3.2 Streaming

3.2.1 Single-token stop string

MLX:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"],
    "thinking": {"type": "disabled"},
    "stream": true
  }' | grep "^data:" | grep "message_delta" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"],
    "stream": true
  }' | grep "^data:" | grep "message_delta" | tail -1 | sed 's/^data: //' | jq .

Expected:

MLX baseline: HTTP 400 (no SSE)
MLX HEAD: SSE — message_delta with stop_reason: "stop_sequence", stop_sequence: "6"
vLLM: SSE — same shape

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "type": "message_delta",
  "delta": {
    "stop_reason": "stop_sequence",
    "stop_sequence": "6"
  },
  "usage": {
    "output_tokens": 25
  }
}

vLLM result:

{
  "type": "message_delta",
  "delta": {
    "stop_reason": "stop_sequence",
    "stop_sequence": "6"
  },
  "usage": {
    "output_tokens": 23
  }
}

4. Regular Generate (`/generate`) — NEW for MLX

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

Response is a JSON array; use jq 'if type == "array" then .[0] else . end | ...'.

matched_stop on MLX Generate: Raw integer token ID (not the original string).
vLLM returns the original string. This is a known limitation — see L1.

4.1 Non-streaming

4.1.1 Single-token string stop

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end |
         {text, finish: .meta_info.finish_reason, matched_stop: .meta_info.matched_stop}'

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end |
         {text, finish: .meta_info.finish_reason, matched_stop: .meta_info.matched_stop}'

Expected:

MLX baseline: HTTP 400
MLX HEAD: HTTP 200, text: "5\n", matched_stop: 21 (integer — known limitation L1)
vLLM: HTTP 200, text: "5\n", matched_stop: "6" (string)

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "5\n",
  "finish": {
    "type": "stop"
  },
  "matched_stop": 21
}
NOTE: `matched_stop` is integer 21 (token ID for "6") — known limitation L1.

vLLM result:

{
  "text": "5\n",
  "finish": {
    "type": "stop"
  },
  "matched_stop": "6"
}

4.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Repeat: 1 2 hello world 3 4 5",
    "sampling_params": {"stop": ["hello world"], "max_new_tokens": 50}
  }' | jq .

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Repeat: 1 2 hello world 3 4 5",
    "sampling_params": {"stop": ["hello world"], "temperature": 0, "max_new_tokens": 50}
  }' | jq

Expected:

MLX (both revisions): HTTP 400 (different error codes)
vLLM: HTTP 200, matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

[
  {
    "text": "\n\nSure! Here's the text repeated three times:\n\n1. 2. ",
    "output_ids": [271, 39814, 0, 5692, 594, 279, 1467, 11504, 2326, 3039, 1447, 16, 13, 220, 17, 13, 23811, 1879],
    "meta_info": {
      "id": "gen-019e50a9-f23b-7231-80a6-72e76e9dca87",
      "finish_reason": {
        "type": "stop"
      },
      "prompt_tokens": 14,
      "weight_version": "default",
      "completion_tokens": 18,
      "cached_tokens": 0,
      "e2e_latency": 0.000112795,
      "matched_stop": "hello world"
    }
  }
]

4.1.3 stop_token_ids

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop_token_ids": [20, 21], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end | {text, matched_stop: .meta_info.matched_stop}'

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop_token_ids": [20, 21], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end | {text, matched_stop: .meta_info.matched_stop}'

Expected:

MLX (both revisions): HTTP 200, text: "", matched_stop: 20
vLLM: HTTP 200, text: "", matched_stop: 20

MLX baseline result:

{
  "text": "",
  "matched_stop": 20
}

MLX HEAD result:

{
  "text": "",
  "matched_stop": 20
}

vLLM result:

{
  "text": "",
  "matched_stop": 20
}

4.2 Streaming

Known limitations (pre-existing, not regressions): In streaming mode the stop token is not
stripped from the final text chunk and matched_stop is absent from the final chunk.
Affects all inference backends (confirmed on vLLM). Non-streaming handles both correctly.
Fix committed; validation pending. See known limitations L2.

4.2.1 Single-token string stop

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50},
    "stream": true
  }' | tail -1 | jq .

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50},
    "stream": true
  }' | tail -1 | jq .

Expected:

MLX baseline: HTTP 400
MLX HEAD: SSE halts at "6"; stop token present in final text chunk, no matched_stop (known limitation L2)
vLLM: SSE halts at "6"; stop token present in final text chunk, no matched_stop (known limitation L2 — affects all backends)

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "5\n6",
  "output_ids": [],
  "meta_info": {
    "id": "gen-019e51cc-33ce-75c3-9cd8-d37723a85cf2-0",
    "finish_reason": "stop",
    "prompt_tokens": 22,
    "weight_version": "default",
    "completion_tokens": 3,
    "cached_tokens": 0,
    "e2e_latency": 0.041989209
  },
  "index": 0
}
NOTE: L2 applies — stop token "6" present in text, `matched_stop` absent from streaming chunk. Consistent with vLLM behaviour.

vLLM result (before fix):

{
  "text": "5\n6",
  "output_ids": [21],
  "meta_info": {
    "id": "gen-019e4d91-4f89-7e12-8ea6-81b8a6cdfe93-0",
    "finish_reason": "stop",
    "prompt_tokens": 22,
    "completion_tokens": 3,
    "cached_tokens": 16,
    "e2e_latency": 0.02720481
  },
  "index": 0
}
NOTE: document command `| tail -1 | jq .` captures `data: [DONE]` (not valid JSON). Result above uses `grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .`. Stop token "6" present in text (known limitation L2 applies to vLLM streaming /generate too); no matched_stop in streaming chunks.

vLLM result (after fix):

{
  "text": "5\n6",
  "output_ids": [
    21
  ],
  "meta_info": {
    "id": "gen-019e50c5-fbca-7141-9fc2-364e2c61b323-0",
    "finish_reason": "stop",
    "prompt_tokens": 22,
    "weight_version": "default",
    "completion_tokens": 3,
    "cached_tokens": 16,
    "e2e_latency": 0.027308119
  },
  "index": 0
}
NOTE: L2 not fixed by this commit — stop token still present in text, matched_stop still absent.

5. Harmony Chat (`/v1/chat/completions` + GPT-OSS) — BUG FIX + NEW for MLX

MLX model: mlx-community/gpt-oss-20b-MXFP4-Q4
vLLM model: openai/gpt-oss-20b

Setup: stop the regular-model worker and start the GPT-OSS model.

Harmony stop token behavior: Harmony does not strip the matched stop token from content.
The stop token appears at the end of content but generation halts — no further tokens produced.
Regular Chat excludes the stop string from the output — Harmony Chat does not.

matched_stop on MLX Harmony Chat: Raw integer token ID.

5.1 Non-streaming

5.1.1 Single-token string stop — was HTTP 400 at baseline, now fixed

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Repeat: hi 1 2 3 4 5 6 7"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 1400
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop, content: .choices[0].message.content}'

Expected:

MLX baseline: HTTP 400
MLX HEAD: HTTP 200, matched_stop: 21 (integer), content ends with "6" (Harmony includes the stop token — known limitation L4)
vLLM: HTTP 200, matched_stop: "6" (string), content ends before "6"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "id": "chatcmpl-019e51af-5e84-7b10-86f5-eec87710863e",
  "object": "chat.completion",
  "created": 1779486908,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to interpret user request: \"Repeat: hi 1 2 3 4 5 6"
      },
      "finish_reason": "stop",
      "matched_stop": 21
    }
  ],
  "usage": {
    "prompt_tokens": 86,
    "completion_tokens": 26,
    "total_tokens": 112,
    "completion_tokens_details": {
      "reasoning_tokens": 24
    }
  },
  "system_fingerprint": "default"
}

NOTE: matched_stop is integer 21 (token ID for "6") — known limitation L1. Stop fired during reasoning content. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_tokens raised to 1400 to ensure "6" appears within the reasoning budget.

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

5.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

MLX (both revisions): HTTP 400 (different error codes)
vLLM: HTTP 200, matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

5.1.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Repeat: hi 1 2 3 4 5 6 7"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 1400
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

MLX (both revisions): HTTP 200, matched_stop: 20
vLLM: HTTP 200, matched_stop: 20

MLX baseline result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

MLX HEAD result:

{
  "id": "chatcmpl-019e51b7-d14c-7ab2-934b-a5198c26dc00",
  "object": "chat.completion",
  "created": 1779487461,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to interpret user request: \"Repeat: hi 1 2 3 4 5"
      },
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 86,
    "completion_tokens": 24,
    "total_tokens": 110,
    "completion_tokens_details": {
      "reasoning_tokens": 22
    }
  },
  "system_fingerprint": "default"
}

NOTE: matched_stop is integer 20 (token ID for "5") — known limitation L1. Stop fired during reasoning content. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_tokens raised to 1400 to ensure the stop token appears within the reasoning budget.

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

5.2 Streaming

5.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Repeat: hi 1 2 3 4 5 6 7"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 1400
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

Expected:

MLX baseline: HTTP 400
MLX HEAD: SSE — final chunk finish_reason: "stop", matched_stop: 21 (integer)
vLLM: SSE — final chunk finish_reason: "stop", matched_stop: "6" (string)

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

// second-to-last chunk — stop token emitted in delta
{"id":"chatcmpl-019e51bb-31fd-7f41-982f-99215064cbde","object":"chat.completion.chunk","created":1779487683,"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":"6"},"logprobs":null,"finish_reason":null}]}

// last chunk — terminal signal
{"id":"chatcmpl-019e51bb-31fd-7f41-982f-99215064cbde","object":"chat.completion.chunk","created":1779487683,"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":21}]}

NOTE: Second-to-last chunk emits the stop token "6" in reasoning_content; final chunk has finish_reason: "stop", matched_stop: 21 (integer — known limitation L1). Stop fired during reasoning content. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_tokens raised to 1400 to ensure "6" appears within the reasoning budget.

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

6. Harmony Responses (`/v1/responses` + GPT-OSS) — NEW for MLX

MLX model: mlx-community/gpt-oss-20b-MXFP4-Q4
vLLM model: openai/gpt-oss-20b

Responses API: Only stop (string array) — no stop_token_ids field.

vLLM note: vLLM silently drops stop on Harmony Responses (upstream gap — stop: vec![] in
build_grpc_sampling_params_from_responses). MLX now handles this path where vLLM does not.

status: Responses API reports stop-sequence termination as "completed".

6.1 Non-streaming

6.1.1 Single-token string stop — was HTTP 400 at baseline, now fixed

MLX:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "input": "Repeat: hi 1 2 3 4 5 6 7",
    "stop": ["6"],
    "max_output_tokens": 1400
  }' | jq .

vLLM:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "input": "Count from 1 to 10, one number per line",
    "stop": ["6"],
    "max_output_tokens": 100
  }' | jq '{status, output_text: (.output[] | select(.type == "message") | .content[] | select(.type == "output_text") | .text)}'

Expected:

MLX baseline: HTTP 400
MLX HEAD: HTTP 200, status: "completed", output ends at "6"
vLLM: ⚠️ HTTP 200 but stop silently dropped — model generates all 10 numbers

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "id": "responses-019e51bd-9929-7720-9e21-63c3efb35f01",
  "object": "response",
  "created_at": 1779487840,
  "status": "completed",
  "max_output_tokens": 1400,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "output": [
    {
      "type": "reasoning",
      "id": "reasoning_responses-019e51bd-9929-7720-9e21-63c3efb35f01",
      "content": [
        {
          "type": "reasoning_text",
          "text": "We need to interpret user request. They say: \"Repeat: hi 1 2 3 4 5 6"
        }
      ],
      "status": "completed"
    }
  ],
  "parallel_tool_calls": true,
  "store": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [],
  "usage": {
    "input_tokens": 82,
    "output_tokens": 29,
    "total_tokens": 111,
    "output_tokens_details": {
      "reasoning_tokens": 27
    }
  },
  "metadata": {}
}

NOTE: status: "completed", reasoning content stops at "6" — stop sequence fired correctly. No matched_stop field in Responses API response. Stop fired during reasoning before any message output block was produced. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_output_tokens raised to 1400.

vLLM result:

(paste here — expect full 1–10 output; confirms vLLM gap)

6.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "input": "Say: hi there and hello world!",
    "stop": ["hello world"],
    "max_output_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "input": "Say: hi there and hello world!",
    "stop": ["hello world"],
    "max_output_tokens": 100
  }' | jq '{status}'

Expected:

MLX (both revisions): HTTP 400 (different error codes)
vLLM: ⚠️ HTTP 200, stop silently dropped

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

(paste here — expect 200 with full output; confirms vLLM gap)

6.2 Streaming

6.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "input": "Repeat: hi 1 2 3 4 5 6 7",
    "stop": ["6"],
    "max_output_tokens": 1400,
    "stream": true
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "input": "Count from 1 to 10, one number per line",
    "stop": ["6"],
    "max_output_tokens": 100,
    "stream": true
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

Expected:

MLX baseline: HTTP 400
MLX HEAD: SSE — response.content_part.done with text ending at "6", then response.completed
vLLM: ⚠️ SSE — stop silently dropped, output contains all 10 numbers

MLX baseline result:

{
  "type": "error",
  "code": "pipeline_error",
  "message": "Pipeline execution failed: Response { status: 400, version: HTTP/1.1, headers: {\"content-type\": \"application/json\", \"x-smg-error-code\": \"invalid_request_parameters\"}, body: Body(UnsyncBoxBody) }",
  "param": null,
  "sequence_number": 2
}
NOTE: Responses API streaming emits the 400 as an SSE error event, so grep captures it (unlike other streaming endpoints).

MLX HEAD result:

// second-to-last event — reasoning item finalised
{"type":"response.output_item.done","sequence_number":3,"output_index":0,"item":{"id":"rs_019e51c553807dd0a7e580e6714b69d4","type":"reasoning","summary":[],"content":null,"encrypted_content":null,"status":null}}

// last event — response completed
{"type":"response.completed","sequence_number":4,"response":{"id":"resp_019e51c5-48df-7043-a6a2-b933db7dca73","object":"response","created_at":1779488344,"status":"completed","model":"mlx-community/gpt-oss-20b-MXFP4-Q4","output":[{"id":"rs_019e51c553807dd0a7e580e6714b69d4","type":"reasoning","summary":[],"content":null,"encrypted_content":null,"status":null}],"usage":{"input_tokens":82,"output_tokens":29,"total_tokens":111},"max_output_tokens":1400,"temperature":1.0,"parallel_tool_calls":true,"store":true,"tools":[],"metadata":{},"tool_choice":"auto"}}

NOTE: status: "completed", output_tokens: 29 matches the non-streaming 6.1.1 result — stop fired correctly. No matched_stop in the Responses API streaming events (known limitation L10). Reasoning content is not surfaced in the final response.output_item.done event (content: null). Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_output_tokens raised to 1400.

vLLM result:

(paste here — expect full 1–10 output; confirms vLLM gap)

Quick test matrix

Symbol key

Symbol	Meaning
✅	HTTP 200, correct behavior
❌	HTTP 4xx/5xx
⚠️	HTTP 200 but incorrect behavior

String stop tests — core before/after comparison

#	Pipeline	Path	Stop	MLX baseline	MLX HEAD	vLLM
1.1.1	Regular Chat	non-stream	single-token `"6"`	❌ 400	✅ `matched_stop:"6"`	✅
1.1.2	Regular Chat	non-stream	multi-token `"hello world"`	❌ 400	❌ 400 (diff msg)	✅
1.1.4	Regular Chat	non-stream	`"5"` + ids `[21]`	❌ 400	✅ `matched_stop:"5"`	✅
1.2.1	Regular Chat	stream	single-token `"6"`	❌ 400	✅ SSE `matched_stop:"6"`	✅
1.2.4	Regular Chat	stream	`"5"` + ids `[21]`	❌ 400	✅ `matched_stop:"5"`	✅
2.1.1	Regular Completion	non-stream	single-token `"6"`	❌ 400	✅ `matched_stop:"6"`	✅
2.1.4	Regular Completion	non-stream	`"5"` + ids `[21]`	❌ 400	✅ `matched_stop:"5"`	✅
2.2.1	Regular Completion	stream	single-token `"6"`	❌ 400	✅ (no `matched_stop`)	✅
3.1.1	Regular Messages	non-stream	single-token `"6"`	❌ 400	✅ `stop_sequence:"6"`	✅
3.2.1	Regular Messages	stream	single-token `"6"`	❌ 400	✅ SSE `stop_sequence:"6"`	✅
4.1.1	Regular Generate	non-stream	single-token `"6"`	❌ 400	✅ `matched_stop:21` (int)	✅ (string)
4.2.1	Regular Generate	stream	single-token `"6"`	❌ 400	✅ (stop token in text, no `matched_stop`)	✅
5.1.1	Harmony Chat	non-stream	single-token `"6"`	❌ 400	✅ `matched_stop:21` (int)	✅ (string)
5.2.1	Harmony Chat	stream	single-token `"6"`	❌ 400	✅ SSE `matched_stop:21`	✅
6.1.1	Harmony Responses	non-stream	single-token `"6"`	❌ 400	✅ `status:"completed"`	⚠️ dropped
6.2.1	Harmony Responses	stream	single-token `"6"`	❌ 400	✅ SSE `response.completed`	⚠️ dropped

stop_token_ids regression — must pass at both revisions

#	Pipeline	Path	Stop	MLX baseline	MLX HEAD
1.1.3	Regular Chat	non-stream	ids `[20,21]`	✅ `matched_stop:20`	✅ `matched_stop:20`
1.2.3	Regular Chat	stream	ids `[20,21]`	✅ `matched_stop:20`	✅ `matched_stop:20`
2.1.3	Regular Completion	non-stream	ids `[20,21]`	✅ `matched_stop:20`	✅ `matched_stop:20`
4.1.3	Regular Generate	non-stream	ids `[20,21]`	✅ `matched_stop:20`	✅ `matched_stop:20`
5.1.3	Harmony Chat	non-stream	ids `[20,21]`	✅ `matched_stop:20`	✅ `matched_stop:20`

zach-li-sudo · 2026-05-25T01:28:21Z

Here's the e2e test PR for this feature: #1538

zach-li-sudo added 10 commits May 23, 2026 13:27

feat(mlx-grpc): support string stop sequences for chat and completion (…

5d47203

…lightseekorg#1099) Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

move mlx match_stop processing logic into proto wrapper

3c043ec

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

fix double gated apply_mlx_stop_sequences: helper is unconditional an…

c9ea694

…d no-ops on non-MLX Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

fix silent encode-error: zero token and failed tokenizer throw 400 error

507099e

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

fix: remove test-case dep and refactor unit tests in chat utils

54ad397

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

use default values for CompletionStreamChoice fields

5aef640

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

rebase before pushing

c0407c2

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

fix fmt and clippy issues

379bfc0

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

feat(mlx-grpc): support matched stop for generate/message and harmony…

fbaa571

… path MLX Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

fix rebase conflicts

8b62c54

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

zach-li-sudo requested review from CatherineSue, key4ng and slin1237 as code owners May 23, 2026 21:11

github-actions Bot added tokenizer Tokenizer related changes grpc gRPC client and router changes protocols Protocols crate changes model-gateway Model gateway crate changes labels May 23, 2026

zach-li-sudo mentioned this pull request May 23, 2026

feat(mlx-grpc): support string stop sequences for chat and completion #1447

Open

4 tasks

coderabbitai Bot approved these changes May 23, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

zach-li-sudo changed the title ~~String stop sequence support for MLX on all 6 pipeline/path combinations~~ feat(mlx-grpc)String stop sequence support for MLX on all 6 pipeline/path combinations May 23, 2026

zach-li-sudo changed the title ~~feat(mlx-grpc)String stop sequence support for MLX on all 6 pipeline/path combinations~~ feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations May 23, 2026

zach-li-sudo added 4 commits May 23, 2026 14:46

refactor(proto_wrapper): restrict matched_stop_json to pub(crate) to …

03e5866

…prevent misuse Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

fix(harmony): remove redundant tokenizer assignment and document on-d…

2aef1f5

…emand loading Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

minor refactor

a748c56

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

fix(streaming) add missing matched stop field in final sse chuck

6e76676

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>

zach-li-sudo force-pushed the stream-all-backend branch from cd5ebaa to 6e76676 Compare May 23, 2026 21:48

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

zach-li-sudo mentioned this pull request May 25, 2026

test(e2e): add basic string stop and matched stop test coverage for MLX #1538

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations#1524

feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations#1524
zach-li-sudo wants to merge 14 commits into
lightseekorg:mainfrom
zach-li-sudo:stream-all-backend

zach-li-sudo commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

zach-li-sudo commented May 24, 2026

Uh oh!

zach-li-sudo commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zach-li-sudo commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

1. MLX string stop sequence support (all 6 pipeline/path combinations)

2. Streaming

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

zach-li-sudo commented May 24, 2026

MLX Stop Sequence Support: Full Pipeline Test Guide

Scope

Revision comparison

Switch between builds

Setup

MLX (Apple Silicon only)

vLLM

Gateway (same for both backends)

Smoke test

Token reference (Qwen tokenizer — shared by Qwen3-0.6B and GPT-OSS)

Baseline quick-check

MLX

vLLM

1. Regular Chat (/v1/chat/completions)

1.1 Non-streaming

1.1.1 Single-token string stop ("stop": ["6"])

1.1.2 Multi-token string stop ("stop": ["hello world"])

1.1.3 stop_token_ids ([20, 21])

1.1.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

1.1.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

1.2 Streaming

1.2.1 Single-token string stop

1.2.2 Multi-token string stop

1.2.3 stop_token_ids

1.2.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

1.2.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

2. Regular Completion (/v1/completions)

2.1 Non-streaming

2.1.1 Single-token string stop

2.1.2 Multi-token string stop

2.1.3 stop_token_ids

2.1.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

2.1.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

2.2 Streaming

2.2.1 Single-token string stop

2.2.2 Multi-token string stop

2.2.3 stop_token_ids

2.2.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

2.2.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

3. Regular Messages (/v1/messages) — NEW for MLX

zach-li-sudo commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading

1. Regular Chat (`/v1/chat/completions`)

1.1.1 Single-token string stop (`"stop": ["6"]`)

1.1.2 Multi-token string stop (`"stop": ["hello world"]`)

1.1.3 stop_token_ids (`[20, 21]`)

1.1.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

1.1.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

1.2.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

1.2.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

2. Regular Completion (`/v1/completions`)

2.1.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

2.1.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

2.2.4 String + stop_token_ids (`"stop": ["5"]`, `"stop_token_ids": [21]`)

2.2.5 Multi-token string + stop_token_ids (`"stop": ["hello world"]`, `"stop_token_ids": [20]`)

3. Regular Messages (`/v1/messages`) — NEW for MLX

4. Regular Generate (`/generate`) — NEW for MLX

5. Harmony Chat (`/v1/chat/completions` + GPT-OSS) — BUG FIX + NEW for MLX

6. Harmony Responses (`/v1/responses` + GPT-OSS) — NEW for MLX