Skip to content

feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations#1524

Open
zach-li-sudo wants to merge 14 commits into
lightseekorg:mainfrom
zach-li-sudo:stream-all-backend
Open

feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations#1524
zach-li-sudo wants to merge 14 commits into
lightseekorg:mainfrom
zach-li-sudo:stream-all-backend

Conversation

@zach-li-sudo
Copy link
Copy Markdown
Contributor

@zach-li-sudo zach-li-sudo commented May 23, 2026

Description

Problem

Follow-up on this PR #1447

Major feat: String stop sequence support for MLX on all 6 pipeline/path combinations
Minor fix: add the missing matched_stop field in regular completion with stream of all backends (vLLM, MLX etc)

# Pipeline API path Change
1 Regular Chat /v1/chat/completions string stop: was HTTP 400 → now single-token supported
2 Regular Completion /v1/completions string stop: was HTTP 400 → now single-token supported
3 Regular Messages /v1/messages string stop: was HTTP 400 → now single-token supported
4 Regular Generate /generate string stop: was HTTP 400 → now single-token supported
5 Harmony Chat /v1/chat/completions + GPT-OSS string stop: was HTTP 400 → now single-token supported
6 Harmony Responses /v1/responses + GPT-OSS string stop: was HTTP 400 → now single-token supported

Solution

Major feat: convert stop strings into stop token ids, then pass to MLX backend
Minor fix: add matched_stop field in the last stream chunks

Changes

see diff

Test Plan

  1. Unit tests for newly added helper functions for string/token id conversion
  2. Deployed MLX gRPC + SMG with regular model (Qwen3-4B) and GPT-OSS harmony model (GPT-OSS-20B) and tested with the following scenarios:

1. MLX string stop sequence support (all 6 pipeline/path combinations)

= HTTP 200 correct result · = HTTP 400

# Pipeline Path Stop input Result
1 Regular Chat /v1/chat/completions "stop": ["6"] (single-token) matched_stop: "6"
1 Regular Chat /v1/chat/completions "stop": ["hello world"] (multi-token) unsupported_stop_string
1 Regular Chat /v1/chat/completions "stop_token_ids": [20, 21] matched_stop: 20
2 Regular Completion /v1/completions "stop": ["6"] (single-token) matched_stop: "6"
2 Regular Completion /v1/completions "stop": ["hello world"] (multi-token) unsupported_stop_string
2 Regular Completion /v1/completions "stop_token_ids": [20, 21] matched_stop: 20
3 Regular Messages /v1/messages "stop_sequences": ["6"] (single-token) stop_sequence: "6"
3 Regular Messages /v1/messages "stop_sequences": ["hello world"] (multi-token) unsupported_stop_string
4 Regular Generate /generate "stop": ["6"] (single-token) matched_stop: 21 ¹
4 Regular Generate /generate "stop": ["hello world"] (multi-token) unsupported_stop_string
5 Harmony Chat /v1/chat/completions "stop": ["6"] (single-token) matched_stop: 21 ¹
5 Harmony Chat /v1/chat/completions "stop_token_ids": [20] matched_stop: 20
6 Harmony Responses /v1/responses "stop": ["6"] (single-token) ✅ stop fires correctly ²

¹ matched_stop on /generate and Harmony paths returns the raw token ID integer, not the original string. The tokenizer is lazy-loaded into the pipeline context (ctx.state.tokenizer) during request building to convert stop strings → token IDs, but the response processors on these paths do not receive the pipeline context and therefore cannot reverse-map the token ID back to the original string.
² Harmony Responses API has no top-level matched_stop field; correct stop is confirmed via status: "completed".

2. Streaming

matched_stop was previously absent from all streaming /v1/completions chunks for all backends — fixed. Other paths are MLX-only new support.

# Pipeline Path Backend Stop input Result
1 Regular Chat /v1/chat/completions MLX "stop": ["6"] (single-token) ✅ final chunk matched_stop: "6"
1 Regular Chat /v1/chat/completions MLX "stop_token_ids": [20, 21] matched_stop: 20
2 Regular Completion /v1/completions MLX "stop": ["6"] matched_stop: "6" (was missing)
2 Regular Completion /v1/completions MLX "stop_token_ids": [20, 21] matched_stop: 20 (was missing)
2 Regular Completion /v1/completions MLX "stop": ["5"] + "stop_token_ids": [21] matched_stop: "5" (was missing)
2 Regular Completion /v1/completions vLLM "stop": ["6"] matched_stop: "6" (was missing)
2 Regular Completion /v1/completions vLLM "stop_token_ids": [20, 21] matched_stop: 20 (was missing)
2 Regular Completion /v1/completions vLLM "stop": ["5"] + "stop_token_ids": [21] matched_stop: "5" (was missing)
3 Regular Messages /v1/messages MLX "stop_sequences": ["6"] (single-token) message_delta with stop_sequence: "6"
5 Harmony Chat /v1/chat/completions MLX "stop": ["6"] (single-token) ✅ final chunk matched_stop: 21 ¹
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • String stop sequences are now fully supported on the MLX backend
    • Matched stop sequence reporting in API responses now accurately reflects user-provided stop conditions across chat, completion, and messages endpoints
  • Tests

    • Enhanced mock tokenizer with failure simulation capabilities

Review Change Stack

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
…d no-ops on non-MLX

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
… path MLX

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
@github-actions github-actions Bot added tokenizer Tokenizer related changes grpc gRPC client and router changes protocols Protocols crate changes model-gateway Model gateway crate changes labels May 23, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4154af76-213d-404e-9427-e6ae4805a57b

📥 Commits

Reviewing files that changed from the base of the PR and between cd5ebaa and 6e76676.

📒 Files selected for processing (4)
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs

📝 Walkthrough

Walkthrough

Adds MLX string stop support: tokenizes user stop strings into MLX stop_token_ids, resolves MLX matched-stop token IDs back into user-facing values via request context and tokenizer, wires this into request builders and response/streaming paths, and removes legacy MLX stop-string rejection.

Changes

MLX Stop Sequence Processing Pipeline

Layer / File(s) Summary
Stop conversion and resolution utilities
model_gateway/src/routers/grpc/utils/chat_utils.rs, model_gateway/src/routers/grpc/utils/mod.rs
stop_strings_to_token_ids, resolve_mlx_matched_stop_json, and resolve_mlx_stop_ids convert stop strings to single-token IDs, map matched MLX token IDs back to user JSON (string preferred), and validate tokenizer availability; includes unit tests and HTTP 400 error mapping.
Proto wrapper context-aware matching
model_gateway/src/routers/grpc/proto_wrapper.rs, model_gateway/src/routers/grpc/regular/processor.rs
Introduces matched_stop_json_with_context(...) that resolves MLX matched-stop token IDs using stop strings/stop_token_ids and a tokenizer; processor paths now call this method for chat, messages, and completion responses.
Request-building integration
model_gateway/src/routers/grpc/common/stages/helpers.rs, model_gateway/src/routers/grpc/harmony/stages/request_building.rs, model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs, model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs, model_gateway/src/routers/grpc/regular/stages/generate/request_building.rs, model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
Adds apply_mlx_stop_sequences to tokenize optional string stops and append token IDs to MLX sampling_params.stop_token_ids; chat, completion, generate, messages, and Harmony builders call this helper using the cached tokenizer in context.
Completion streaming finalization
model_gateway/src/routers/grpc/regular/streaming.rs
Defers final finish_reason emission when local stop decoder fires in Chunk events so the subsequent Complete event can include backend matched_stop_json_with_context(); simplifies CompletionStreamChoice construction using Default.
Legacy cleanup and testing support
crates/grpc_client/src/mlx_engine.rs, crates/protocols/src/completion.rs, crates/tokenizer/src/mock.rs
Removes reject_stop_strings() checks/TODO from MLX engine builders. CompletionStreamChoice derives Default. MockTokenizer adds fail_encode: bool and failing() for negative tokenization tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

tests

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

Poem

🐰 In tunnels of code I hop and sing,
Stops once banned now wear a ring.
Tokenize, resolve, stitch the flow,
From request to finish—matched stops show.
A tiny hop, a testing spring.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main feature: string stop sequence support for MLX across six pipeline/path combinations, matching the core changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables support for string stop sequences in the MLX backend by tokenizing them into single-token IDs during the request preparation stage. It also introduces logic to map the matched stop token ID back to its original string or numeric representation in API responses for both regular and streaming workflows. Feedback was provided regarding the efficiency of tokenizing stop strings within the response processing loop, suggesting that pre-tokenizing or caching these values could improve performance in high-throughput scenarios.

// Check stop strings first: find the string that tokenizes to this single token.
if let Some(stop_strings) = stop {
for s in stop_strings.iter() {
if let Ok(enc) = tokenizer.encode(s, false) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Tokenizing stop strings in a loop for every completion response can be inefficient, especially in high-throughput scenarios. While the number of stop sequences is typically small (OpenAI limits to 4), consider pre-tokenizing these strings during the request building stage and passing the mapping down to the response processor, or at least caching the results if the tokenizer is shared.

@zach-li-sudo zach-li-sudo changed the title String stop sequence support for MLX on all 6 pipeline/path combinations feat(mlx-grpc)String stop sequence support for MLX on all 6 pipeline/path combinations May 23, 2026
@zach-li-sudo zach-li-sudo changed the title feat(mlx-grpc)String stop sequence support for MLX on all 6 pipeline/path combinations feat(mlx-grpc): String stop sequence support for MLX on all 6 pipeline/path combinations May 23, 2026
…prevent misuse

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
…emand loading

Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Signed-off-by: Zhuo Li <zhuo.li.ca@outlook.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e76676189

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 152 to 156
Self::reject_constraint(constraint.as_ref())?;
Self::reject_n(body.n)?;
Self::reject_stop_strings(body.stop.as_ref().is_some_and(|s| !s.is_empty()))?;
Self::reject_response_format(body.response_format.is_some())?;

let sampling_params = Self::build_sampling_params_from_chat(body);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject unconverted stop strings in MLX request builders

Removing stop-string rejection in this builder allows direct callers to pass stop values that are never converted into sampling_params.stop_token_ids. That conversion now happens only in the pipeline stages, but some call paths still build requests directly (for example via GrpcClient::build_generate_request_from_chat in the Go policy binding), so MLX will silently ignore those string stops instead of enforcing them. This is a behavior regression from fail-fast (400) to silent no-op, which can produce longer-than-requested outputs and wrong stop semantics.

Useful? React with 👍 / 👎.

@zach-li-sudo
Copy link
Copy Markdown
Contributor Author

MLX Stop Sequence Support: Full Pipeline Test Guide

Branch: stream-all-backend
Companion doc: MLX-string-stop-all-paths.md (narrative/what changed)
Purpose: Pre-review before/after comparison — run against both the baseline revision and HEAD.


Scope

Six pipeline/path combinations. All string stop sequence support is new in this branch.

# Pipeline API path Change
1 Regular Chat /v1/chat/completions string stop: was HTTP 400 → now single-token supported
2 Regular Completion /v1/completions string stop: was HTTP 400 → now single-token supported
3 Regular Messages /v1/messages string stop: was HTTP 400 → now single-token supported
4 Regular Generate /generate string stop: was HTTP 400 → now single-token supported
5 Harmony Chat /v1/chat/completions + GPT-OSS string stop: was HTTP 400 → now single-token supported
6 Harmony Responses /v1/responses + GPT-OSS string stop: was HTTP 400 → now single-token supported

Revision comparison

Build Revision String stop on MLX
Baseline 9a93938a HTTP 400, invalid_request_parameters"MLX backend does not support string stop sequences"
HEAD current branch tip Single-token accepted; multi-token: HTTP 400 unsupported_stop_string"stop string \"…\" encodes to N tokens; MLX backend only supports single-token stop strings"

Switch between builds

# Baseline
git checkout 9a93938a && cargo build

# HEAD
git checkout stream-all-backend && cargo build

Setup

MLX (Apple Silicon only)

Install Python deps once:

source .venv/bin/activate
pip install -e ./crates/grpc_client/python
pip install -e "./grpc_servicer[mlx]"

MLX worker — regular model (tests 1–4):

source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
  --model mlx-community/Qwen3-0.6B-4bit --port 50051

MLX worker — Harmony model (tests 5–6; stop the regular worker first):

source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
  --model mlx-community/gpt-oss-20b-MXFP4-Q4 --port 50051

vLLM

vLLM worker — regular model (tests 1–4):

python -m vllm.entrypoints.grpc_server --model Qwen/Qwen2.5-1.5B-Instruct --port 50051

vLLM worker — Harmony model (tests 5–6):

python -m vllm.entrypoints.grpc_server --model openai/gpt-oss-20b --port 50051

Gateway (same for both backends)

./target/debug/smg --worker-urls grpc://localhost:50051 --port 3000

Smoke test

curl http://localhost:3000/v1/models | jq '.data[].id'

Token reference (Qwen tokenizer — shared by Qwen3-0.6B and GPT-OSS)

Token ID Text
20 "5"
21 "6"
198 "\n"

Qwen3 thinking mode: /v1/messages and /v1/chat/completions with Qwen3-0.6B-4bit need
"thinking": {"type": "disabled"} to prevent burning max_tokens on <think> tokens.
Not needed for /generate (no chat template) or GPT-OSS models or vLLM (Qwen2.5-1.5B).


Baseline quick-check

MLX

Build at 9a93938a and run these six commands — all must return 400.
Switch to HEAD build — all must return 200.

# 1. Chat
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200,"thinking":{"type":"disabled"}}' | jq .

# 2. Completion
curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","prompt":"1\n2\n3\n4\n","stop":["6"],"max_tokens":200}' | jq .

# 3. Messages
curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","messages":[{"role":"user","content":"Count 1-10"}],"stop_sequences":["6"],"max_tokens":200,"thinking":{"type":"disabled"}}' | jq .

# 4. Generate
curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-0.6B-4bit","text":"1\n2\n3\n4\n","sampling_params":{"stop":["6"],"max_new_tokens":200}}' | jq .

# 5. Harmony Chat  [requires Harmony model worker]
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .

# 6. Harmony Responses  [requires Harmony model worker]
curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","input":"Count 1-10","stop":["6"],"max_output_tokens":200}' | jq .

Baseline result (all six):

# 1. Chat
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 2. Completion
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 3. Messages
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 4. Generate
{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

# 5. Harmony Chat
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

# 6. Harmony Responses
{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

HEAD result (all six):

# 1. Chat
{
  "id": "chatcmpl-019e51a2-26d8-7cb1-a462-52195b649218",
  "object": "chat.completion",
  "created": 1779486041,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, the user wants me to count from 1 to 10. Let me start by writing down the numbers in order: 1, 2, 3, 4, 5,"
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 46,
    "total_tokens": 60
  },
  "system_fingerprint": "default"
}

# 2. Completion
{
  "id": "cmpl_019e51a2-2788-7c43-8b42-8d5c9e62e07a",
  "object": "text_completion",
  "created": 1779486041,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "choices": [
    {
      "text": "5\n",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 3,
    "total_tokens": 11
  },
  "system_fingerprint": "default"
}

# 3. Messages
{
  "id": "msg_019e51a2-27ad-7d02-800d-bd98b8ac390b",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Okay, so I need to count from 1 to 10. Let me start with 1. I'm going to count one after another. So, 1, 2, 3, 4, 5, "
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "usage": {
    "input_tokens": 18,
    "output_tokens": 50
  }
}

# 4. Generate
[
  {
    "text": "5\n",
    "output_ids": [
      20,
      198,
      21
    ],
    "meta_info": {
      "id": "gen-019e51a2-284e-7ae1-8e8e-9e69343cb6b5",
      "finish_reason": {
        "type": "stop"
      },
      "prompt_tokens": 8,
      "weight_version": "default",
      "completion_tokens": 3,
      "cached_tokens": 0,
      "e2e_latency": 0.000055709,
      "matched_stop": 21
    }
  }
]
NOTE: matched_stop is integer 21 (token ID for "6"), not the string "6" — known limitation L1.

# 5. Harmony Chat   — SKIPPED (Harmony model worker not running)
# 6. Harmony Responses — SKIPPED (Harmony model worker not running)

vLLM

Run these six commands at either revision — all should return 200 (vLLM supports string stops natively, no change across revisions).

# 1. Chat
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .

# 2. Completion
curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"1\n2\n3\n4\n","stop":["6"],"max_tokens":200}' | jq .

# 3. Messages
curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"Count 1-10"}],"stop_sequences":["6"],"max_tokens":200}' | jq .

# 4. Generate
curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","text":"1\n2\n3\n4\n","sampling_params":{"stop":["6"],"max_new_tokens":200}}' | jq .

# 5. Harmony Chat  [requires Harmony model worker]
curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"Count 1-10"}],"stop":["6"],"max_tokens":200}' | jq .

# 6. Harmony Responses  [requires Harmony model worker]
curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","input":"Count 1-10","stop":["6"],"max_output_tokens":200}' | jq .

vLLM result (all six):

# 1. Chat
{
  "id": "chatcmpl-019e4d80-ccd6-7011-abbc-e9d1eb55444d",
  "object": "chat.completion",
  "created": 1779416747,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here is the count from 1 to 10:\n\n1, 2, 3, 4, 5, ",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 35,
    "completion_tokens": 28,
    "total_tokens": 63,
    "prompt_tokens_details": { "cached_tokens": 32 }
  },
  "system_fingerprint": "default"
}

# 2. Completion
{
  "id": "cmpl_019e4d80-ddcd-7493-a43c-136420ddf72b",
  "object": "text_completion",
  "created": 1779416751,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "text": "5\n",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": { "prompt_tokens": 8, "completion_tokens": 3, "total_tokens": 11 },
  "system_fingerprint": "default"
}

# 3. Messages
{
  "id": "msg_019e4d80-f0ac-7b23-98b4-36628e0614c9",
  "type": "message",
  "role": "assistant",
  "content": [ { "type": "text", "text": "Sure! Here's the count from 1 to 10:\n\n1, 2, 3, 4, 5, " } ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "usage": { "input_tokens": 35, "output_tokens": 30 }
}

# 4. Generate
[
  {
    "text": "5\n",
    "output_ids": [ 20, 198, 21 ],
    "meta_info": {
      "id": "gen-019e4d80-fed5-7bd2-a183-cfb745f181a7",
      "finish_reason": { "type": "stop" },
      "prompt_tokens": 8,
      "completion_tokens": 3,
      "matched_stop": "6"
    }
  }
]

# 5. Harmony Chat
{
  "id": "chatcmpl-019e51a4-f163-7dc3-ba6a-fef5e76a11d4",
  "object": "chat.completion",
  "created": 1779486224,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to respond to \"Count 1-10\". The user says \"Count 1-10\". Likely they want us to count from 1 to 10. So we should output numbers 1 to 10. Probably each number on new line. So answer: 1 2 3 4 5 6"
      },
      "finish_reason": "stop",
      "matched_stop": 21
    }
  ],
  "usage": {
    "prompt_tokens": 74,
    "completion_tokens": 72,
    "total_tokens": 146,
    "completion_tokens_details": {
      "reasoning_tokens": 70
    }
  },
  "system_fingerprint": "default"
}
NOTE: matched_stop is integer 21 (token ID for "6"), not the string "6" — known limitation L1.
Stop fired during reasoning content before actual output text was produced.

# 6. Harmony Responses
{
  "id": "responses-019e51a4-fa2d-7442-9bb0-12d758f42b50",
  "object": "response",
  "created_at": 1779486226,
  "status": "completed",
  "max_output_tokens": 200,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "output": [
    {
      "type": "reasoning",
      "id": "reasoning_responses-019e51a4-fa2d-7442-9bb0-12d758f42b50",
      "content": [
        {
          "type": "reasoning_text",
          "text": "We need to interpret the user request. They say: \"Count 1-10\". They want us to count from 1 to 10. Possibly they want us to count. They might want us to count inclusive of both ends. So expected output: \"1, 2, 3, 4, 5, 6"
        }
      ],
      "status": "completed"
    }
  ],
  "parallel_tool_calls": true,
  "store": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [],
  "usage": {
    "input_tokens": 70,
    "output_tokens": 72,
    "total_tokens": 142,
    "output_tokens_details": {
      "reasoning_tokens": 70
    }
  },
  "metadata": {}
}
NOTE: status "completed" confirms stop fired correctly. Stop triggered during reasoning block
before actual output message was emitted — consistent with known limitation L1 (integer matched_stop).

1. Regular Chat (/v1/chat/completions)

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

1.1 Non-streaming


1.1.1 Single-token string stop ("stop": ["6"])

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 hello world 4 5 6 7"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq

Expected:

  • MLX baseline: HTTP 400 invalid_request_parameters
  • MLX HEAD: HTTP 200, finish_reason: "stop", matched_stop: "6", content ends before 6
  • vLLM: HTTP 200, finish_reason: "stop", matched_stop: "6", content ends before 6

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "id": "chatcmpl-019e5219-ac15-7453-8e07-76ba0e5c1a61",
  "object": "chat.completion",
  "created": 1779493874,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, let's see. The user is asking me to repeat the string \"1 2 3 hello world 4 5"
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 26,
    "completion_tokens": 31,
    "total_tokens": 57
  },
  "system_fingerprint": "default"
}

NOTE: Stop fired during reasoning content — content is null, reasoning_content truncates at "6". Prompt changed to "Repeat: 1 2 3 hello world 4 5 6 7" to ensure "6" appears early in reasoning within the 100-token budget.

vLLM result:

{
  "id": "chatcmpl-019e4db2-5127-7c53-bc6b-090fd23eb7c3",
  "object": "chat.completion",
  "created": 1779419992,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here is the count from 1 to 10:\n\n1  \n2  \n3  \n4  \n5  \n",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 23,
    "total_tokens": 65,
    "prompt_tokens_details": {
      "cached_tokens": 32
    }
  },
  "system_fingerprint": "default"
}

1.1.2 Multi-token string stop ("stop": ["hello world"])

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq

Expected:

  • MLX baseline: HTTP 400 invalid_request_parameters
  • MLX HEAD: HTTP 400 unsupported_stop_string (still 400, different error)
  • vLLM: HTTP 200, finish_reason: "stop", matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "id": "chatcmpl-019e50a4-50ce-73c2-8afd-e445936604de",
  "object": "chat.completion",
  "created": 1779469406,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1 2 3 ",
        "reasoning_content": null
      },
      "finish_reason": "stop",
      "matched_stop": "hello world"
    }
  ],
  "usage": {
    "prompt_tokens": 44,
    "completion_tokens": 7,
    "total_tokens": 51,
    "prompt_tokens_details": {
      "cached_tokens": 16
    }
  },
  "system_fingerprint": "default"
}

1.1.3 stop_token_ids ([20, 21])

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX baseline: HTTP 200, matched_stop: 20
  • MLX HEAD: HTTP 200, matched_stop: 20
  • vLLM: HTTP 200, matched_stop: 20

MLX baseline result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

MLX HEAD result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

vLLM result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

1.1.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

String "5" (token 20) fires before token ID 21 ("6") — matched_stop should be the string.

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 4 5 6 7"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX baseline: HTTP 400 (string stop present → rejected)
  • MLX HEAD: HTTP 200, matched_stop: "5" (string wins)
  • vLLM: HTTP 200, matched_stop: "5"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "finish_reason": "stop",
  "matched_stop": "5"
}

NOTE: Prompt changed to "Repeat: 1 2 3 4 5 6 7" to ensure "5" appears early in reasoning before "6" (token ID 21). String stop "5" fires first — matched_stop is the string.

vLLM result:

{
  "finish_reason": "stop",
  "matched_stop": "5"
}

1.1.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: HTTP 400 unsupported_stop_string (multi-token still rejected)
  • vLLM: HTTP 200, matched_stop: 20 (token ID fires first)

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

1.2 Streaming


1.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 hello world 4 5 6 7"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX baseline: HTTP 400 (no SSE stream)
  • MLX HEAD: SSE — final chunk finish_reason: "stop", matched_stop: "6"
  • vLLM: SSE — final chunk finish_reason: "stop", matched_stop: "6"

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "chatcmpl-019e521d-ba2c-7352-98d6-d5818f68a59b",
  "object": "chat.completion.chunk",
  "created": 1779494140,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ]
}

NOTE: Prompt changed to "Repeat: 1 2 3 hello world 4 5 6 7" to ensure "6" appears early in reasoning within the 100-token budget.

vLLM result:

{
  "id": "chatcmpl-019e4d86-aa85-7132-8507-59dcdb557fd0",
  "object": "chat.completion.chunk",
  "created": 1779417131,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ]
}

1.2.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX (both revisions): HTTP 400 (different error codes)
  • vLLM: SSE — finish_reason: "stop", matched_stop: "hello world"

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result:

{
  "id": "chatcmpl-019e4d86-ae13-7273-aa55-07722534d888",
  "object": "chat.completion.chunk",
  "created": 1779417132,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "hello world"
    }
  ]
}

1.2.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 4 5 6 7"}],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX (both revisions): SSE — finish_reason: "stop", matched_stop: 20
  • vLLM: SSE — finish_reason: "stop", matched_stop: 20

MLX baseline result:

{
  "id": "chatcmpl-019e50df-2873-7c60-b792-8f731327b269",
  "object": "chat.completion.chunk",
  "created": 1779473262,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ]
}
NOTE: result is non-deterministic. The streaming stop_token_ids mechanism works correctly (confirmed on re-run: finish_reason "stop", matched_stop:20), but `thinking: {"type": "disabled"}` has no effect in the baseline — Qwen3 always enters thinking mode. Whether token 20 appears within the 100-token budget varies per run; if it doesn't, the generation hits the length limit instead.

MLX HEAD result:

{
  "id": "chatcmpl-019e521d-ba2d-78a3-95ba-c943209df5a8",
  "object": "chat.completion.chunk",
  "created": 1779494140,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ]
}

NOTE: Prompt changed to "Repeat: 1 2 3 4 5 6 7" to ensure token 20 ("5") appears early in reasoning within the 100-token budget.

vLLM result:

{
  "id": "chatcmpl-019e4d86-b142-7712-9b96-46031c8cb5e0",
  "object": "chat.completion.chunk",
  "created": 1779417133,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ]
}

1.2.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Repeat: 1 2 3 4 5 6 7"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: SSE — matched_stop: "5"
  • vLLM: SSE — matched_stop: "5"

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "chatcmpl-019e521d-ba30-7ae0-bd0f-447bccc20bba",
  "object": "chat.completion.chunk",
  "created": 1779494140,
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ]
}

NOTE: Prompt changed to "Repeat: 1 2 3 4 5 6 7" to ensure "5" appears early in reasoning before token ID 21 ("6").

vLLM result:

{
  "id": "chatcmpl-019e4d86-b9d9-7293-8c31-33a5ff2c57d9",
  "object": "chat.completion.chunk",
  "created": 1779417135,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ]
}

1.2.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100,
    "thinking": {"type": "disabled"}
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX (both revisions): HTTP 400 (different error codes)
  • vLLM: SSE — matched_stop: 20

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result:

{
  "id": "chatcmpl-019e4d86-bcea-75c2-97a6-7e34ec843dfb",
  "object": "chat.completion.chunk",
  "created": 1779417136,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default",
  "choices": [
    {
      "index": 0,
      "delta": {
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ]
}

2. Regular Completion (/v1/completions)

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

/v1/completions passes the prompt as raw text — no chat template, no thinking flag needed.

matched_stop in streaming: Streaming /v1/completions does not include matched_stop
in any SSE chunk for either backend. This is a pre-existing behavior (see known limitation L3).

2.1 Non-streaming


2.1.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: HTTP 200, text: "5\n", matched_stop: "6"
  • vLLM: HTTP 200, text: "5\n", matched_stop: "6"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "5\n",
  "finish_reason": "stop",
  "matched_stop": "6"
}

vLLM result:

{
  "text": "5\n",
  "finish_reason": "stop",
  "matched_stop": "6"
}

2.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX (both revisions): HTTP 400 (different error codes)
  • vLLM: HTTP 200, matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "text": " 6\n\nSure, here is the repeated text:\n\n1 2 3 ",
  "matched_stop": "hello world"
}

2.1.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX (both revisions): HTTP 200, text: "", matched_stop: 20
  • vLLM: HTTP 200, text: "", matched_stop: 20

MLX baseline result:

{
  "text": "",
  "matched_stop": 20
}

MLX HEAD result:

{
  "text": "",
  "matched_stop": 20
}

vLLM result:

{
  "text": "",
  "matched_stop": 20
}

2.1.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: HTTP 200, text: "", matched_stop: "5"
  • vLLM: HTTP 200, text: "", matched_stop: "5"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "",
  "matched_stop": "5"
}

vLLM result:

{
  "text": "",
  "matched_stop": "5"
}

2.1.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": false,
    "max_tokens": 100
  }' | jq '{text: .choices[0].text, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX (both revisions): HTTP 400
  • vLLM: HTTP 200, text: "", matched_stop: 20

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "text": "",
  "matched_stop": 20
}

2.2 Streaming


2.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: SSE — final chunk finish_reason: "stop", no matched_stop (known limitation L3)
  • vLLM: SSE — final chunk finish_reason: "stop", no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "cmpl_019e51cc-33ba-7872-afe1-b94b86f63b2a",
  "object": "text_completion",
  "created": 1779488797,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}
NOTE: `matched_stop: "6"` present in streaming chunk — L3 is fixed in HEAD for MLX.

vLLM result (before fix):

{
  "id": "cmpl_019e4d86-dace-7061-a150-f55edf957d8d",
  "object": "text_completion",
  "created": 1779416744,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-da31-7052-8dc8-34799b86f78f",
  "object": "text_completion",
  "created": 1779471604,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "6"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 200
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Repeat exactly: 1 2 3 hello world 4 5",
    "stop": ["hello world"],
    "stream": true,
    "max_tokens": 200
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX (both revisions): HTTP 400
  • vLLM: SSE — finish_reason: "stop", no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result (before fix):

{
  "id": "cmpl_019e50a7-f313-78a3-bc48-95cf42d086f7",
  "object": "text_completion",
  "created": 1779469644,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c8-f112-7500-9929-9e3719837aef",
  "object": "text_completion",
  "created": 1779471806,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "hello world"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop_token_ids": [20, 21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX (both revisions): SSE — stops immediately, finish_reason: "stop", no matched_stop
  • vLLM: same

MLX baseline result:

{
  "id": "cmpl_019e50df-a6c8-7332-ac63-38eabbebdddb",
  "object": "text_completion",
  "created": 1779473295,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}

MLX HEAD result:

{
  "id": "cmpl_019e51cc-33c1-75e3-ae52-8e5fb96233e2",
  "object": "text_completion",
  "created": 1779488797,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}
NOTE: `matched_stop: 20` present in streaming chunk — L3 fixed in HEAD.

vLLM result (before fix):

{
  "id": "cmpl_019e4d87-7b29-76b3-a921-c68388c35493",
  "object": "text_completion",
  "created": 1779417185,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-e7d3-72c0-97fc-afff84a7d92d",
  "object": "text_completion",
  "created": 1779471607,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.4 String + stop_token_ids ("stop": ["5"], "stop_token_ids": [21])

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["5"],
    "stop_token_ids": [21],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: SSE — stops immediately, no matched_stop
  • vLLM: SSE — stops immediately, no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "id": "cmpl_019e51cc-33c5-7660-a9aa-1ce5e36f8d08",
  "object": "text_completion",
  "created": 1779488797,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ],
  "model": "mlx-community/Qwen3-0.6B-4bit",
  "system_fingerprint": "default"
}
NOTE: `matched_stop: "5"` present in streaming chunk — L3 fixed in HEAD.

vLLM result (before fix):

{
  "id": "cmpl_019e4d87-7d74-78b2-992d-6ee168192a78",
  "object": "text_completion",
  "created": 1779417185,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-efa1-7b31-bfa3-24e4d1fc6fc1",
  "object": "text_completion",
  "created": 1779471609,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": "5"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

2.2.5 Multi-token string + stop_token_ids ("stop": ["hello world"], "stop_token_ids": [20])

MLX:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "stop": ["hello world"],
    "stop_token_ids": [20],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX (both revisions): HTTP 400
  • vLLM: SSE — stops immediately, no matched_stop

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

vLLM result (before fix):

{
  "id": "cmpl_019e4d87-7f04-72e2-aabe-bc0a42b80a44",
  "object": "text_completion",
  "created": 1779417186,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

vLLM result (after fix):

{
  "id": "cmpl_019e50c5-f570-7710-83b7-dc1f75e26910",
  "object": "text_completion",
  "created": 1779471611,
  "choices": [
    {
      "text": "",
      "index": 0,
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "system_fingerprint": "default"
}

3. Regular Messages (/v1/messages) — NEW for MLX

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

The Messages API uses stop_sequences (string array only). There is no stop_token_ids field.
Test integer stop IDs on this path via the smg_sampling_params extension or use /generate.

3.1 Non-streaming


3.1.1 Single-token stop string

MLX:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"],
    "thinking": {"type": "disabled"}
  }' | jq '{stop_reason, stop_sequence, content: .content[0].text}'

vLLM:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"]
  }' | jq '{stop_reason, stop_sequence, content: .content[0].text}'

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: HTTP 200, stop_reason: "stop_sequence", stop_sequence: "6", content ends before 6
  • vLLM: HTTP 200, stop_reason: "stop_sequence", stop_sequence: "6"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "content": "Starting from 1 to 10, one number per line:\n\n1  \n2  \n3  \n4  \n5  \n"
}

vLLM result:

{
  "stop_reason": "stop_sequence",
  "stop_sequence": "6",
  "content": "Here is the count from 1 to 10:\n\n1  \n2  \n3  \n4  \n5  \n"
}

3.1.2 Multi-token stop string

MLX:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "max_tokens": 100,
    "stop_sequences": ["hello world"]
  }' | jq .

vLLM:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
    "max_tokens": 100,
    "stop_sequences": ["hello world"]
  }' | jq '{stop_reason, stop_sequence}'

Expected:

  • MLX (both revisions): HTTP 400 (different error codes)
  • vLLM: HTTP 200, stop_reason: "stop_sequence", stop_sequence: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

{
  "stop_reason": "stop_sequence",
  "stop_sequence": "hello world"
}

3.2 Streaming


3.2.1 Single-token stop string

MLX:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"],
    "thinking": {"type": "disabled"},
    "stream": true
  }' | grep "^data:" | grep "message_delta" | tail -1 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "max_tokens": 100,
    "stop_sequences": ["6"],
    "stream": true
  }' | grep "^data:" | grep "message_delta" | tail -1 | sed 's/^data: //' | jq .

Expected:

  • MLX baseline: HTTP 400 (no SSE)
  • MLX HEAD: SSE — message_delta with stop_reason: "stop_sequence", stop_sequence: "6"
  • vLLM: SSE — same shape

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

{
  "type": "message_delta",
  "delta": {
    "stop_reason": "stop_sequence",
    "stop_sequence": "6"
  },
  "usage": {
    "output_tokens": 25
  }
}

vLLM result:

{
  "type": "message_delta",
  "delta": {
    "stop_reason": "stop_sequence",
    "stop_sequence": "6"
  },
  "usage": {
    "output_tokens": 23
  }
}

4. Regular Generate (/generate) — NEW for MLX

MLX model: mlx-community/Qwen3-0.6B-4bit
vLLM model: Qwen/Qwen2.5-1.5B-Instruct

Response is a JSON array; use jq 'if type == "array" then .[0] else . end | ...'.

matched_stop on MLX Generate: Raw integer token ID (not the original string).
vLLM returns the original string. This is a known limitation — see L1.

4.1 Non-streaming


4.1.1 Single-token string stop

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end |
         {text, finish: .meta_info.finish_reason, matched_stop: .meta_info.matched_stop}'

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end |
         {text, finish: .meta_info.finish_reason, matched_stop: .meta_info.matched_stop}'

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: HTTP 200, text: "5\n", matched_stop: 21 (integer — known limitation L1)
  • vLLM: HTTP 200, text: "5\n", matched_stop: "6" (string)

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "5\n",
  "finish": {
    "type": "stop"
  },
  "matched_stop": 21
}
NOTE: `matched_stop` is integer 21 (token ID for "6") — known limitation L1.

vLLM result:

{
  "text": "5\n",
  "finish": {
    "type": "stop"
  },
  "matched_stop": "6"
}

4.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Repeat: 1 2 hello world 3 4 5",
    "sampling_params": {"stop": ["hello world"], "max_new_tokens": 50}
  }' | jq .

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Repeat: 1 2 hello world 3 4 5",
    "sampling_params": {"stop": ["hello world"], "temperature": 0, "max_new_tokens": 50}
  }' | jq

Expected:

  • MLX (both revisions): HTTP 400 (different error codes)
  • vLLM: HTTP 200, matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

[
  {
    "text": "\n\nSure! Here's the text repeated three times:\n\n1. 2. ",
    "output_ids": [271, 39814, 0, 5692, 594, 279, 1467, 11504, 2326, 3039, 1447, 16, 13, 220, 17, 13, 23811, 1879],
    "meta_info": {
      "id": "gen-019e50a9-f23b-7231-80a6-72e76e9dca87",
      "finish_reason": {
        "type": "stop"
      },
      "prompt_tokens": 14,
      "weight_version": "default",
      "completion_tokens": 18,
      "cached_tokens": 0,
      "e2e_latency": 0.000112795,
      "matched_stop": "hello world"
    }
  }
]

4.1.3 stop_token_ids

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop_token_ids": [20, 21], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end | {text, matched_stop: .meta_info.matched_stop}'

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop_token_ids": [20, 21], "max_new_tokens": 50}
  }' | jq 'if type == "array" then .[0] else . end | {text, matched_stop: .meta_info.matched_stop}'

Expected:

  • MLX (both revisions): HTTP 200, text: "", matched_stop: 20
  • vLLM: HTTP 200, text: "", matched_stop: 20

MLX baseline result:

{
  "text": "",
  "matched_stop": 20
}

MLX HEAD result:

{
  "text": "",
  "matched_stop": 20
}

vLLM result:

{
  "text": "",
  "matched_stop": 20
}

4.2 Streaming

Known limitations (pre-existing, not regressions): In streaming mode the stop token is not
stripped from the final text chunk and matched_stop is absent from the final chunk.
Affects all inference backends (confirmed on vLLM). Non-streaming handles both correctly.
Fix committed; validation pending. See known limitations L2.


4.2.1 Single-token string stop

MLX:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50},
    "stream": true
  }' | tail -1 | jq .

vLLM:

curl http://localhost:3000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "text": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
    "sampling_params": {"stop": ["6"], "max_new_tokens": 50},
    "stream": true
  }' | tail -1 | jq .

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: SSE halts at "6"; stop token present in final text chunk, no matched_stop (known limitation L2)
  • vLLM: SSE halts at "6"; stop token present in final text chunk, no matched_stop (known limitation L2 — affects all backends)

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "build_request_failed",
    "message": "MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "text": "5\n6",
  "output_ids": [],
  "meta_info": {
    "id": "gen-019e51cc-33ce-75c3-9cd8-d37723a85cf2-0",
    "finish_reason": "stop",
    "prompt_tokens": 22,
    "weight_version": "default",
    "completion_tokens": 3,
    "cached_tokens": 0,
    "e2e_latency": 0.041989209
  },
  "index": 0
}
NOTE: L2 applies — stop token "6" present in text, `matched_stop` absent from streaming chunk. Consistent with vLLM behaviour.

vLLM result (before fix):

{
  "text": "5\n6",
  "output_ids": [21],
  "meta_info": {
    "id": "gen-019e4d91-4f89-7e12-8ea6-81b8a6cdfe93-0",
    "finish_reason": "stop",
    "prompt_tokens": 22,
    "completion_tokens": 3,
    "cached_tokens": 16,
    "e2e_latency": 0.02720481
  },
  "index": 0
}
NOTE: document command `| tail -1 | jq .` captures `data: [DONE]` (not valid JSON). Result above uses `grep "^data:" | grep -v "\[DONE\]" | tail -1 | sed 's/^data: //' | jq .`. Stop token "6" present in text (known limitation L2 applies to vLLM streaming /generate too); no matched_stop in streaming chunks.

vLLM result (after fix):

{
  "text": "5\n6",
  "output_ids": [
    21
  ],
  "meta_info": {
    "id": "gen-019e50c5-fbca-7141-9fc2-364e2c61b323-0",
    "finish_reason": "stop",
    "prompt_tokens": 22,
    "weight_version": "default",
    "completion_tokens": 3,
    "cached_tokens": 16,
    "e2e_latency": 0.027308119
  },
  "index": 0
}
NOTE: L2 not fixed by this commit — stop token still present in text, matched_stop still absent.

5. Harmony Chat (/v1/chat/completions + GPT-OSS) — BUG FIX + NEW for MLX

MLX model: mlx-community/gpt-oss-20b-MXFP4-Q4
vLLM model: openai/gpt-oss-20b

Setup: stop the regular-model worker and start the GPT-OSS model.

Harmony stop token behavior: Harmony does not strip the matched stop token from content.
The stop token appears at the end of content but generation halts — no further tokens produced.
Regular Chat excludes the stop string from the output — Harmony Chat does not.

matched_stop on MLX Harmony Chat: Raw integer token ID.

5.1 Non-streaming


5.1.1 Single-token string stop — was HTTP 400 at baseline, now fixed

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Repeat: hi 1 2 3 4 5 6 7"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 1400
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop, content: .choices[0].message.content}'

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: HTTP 200, matched_stop: 21 (integer), content ends with "6" (Harmony includes the stop token — known limitation L4)
  • vLLM: HTTP 200, matched_stop: "6" (string), content ends before "6"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "id": "chatcmpl-019e51af-5e84-7b10-86f5-eec87710863e",
  "object": "chat.completion",
  "created": 1779486908,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to interpret user request: \"Repeat: hi 1 2 3 4 5 6"
      },
      "finish_reason": "stop",
      "matched_stop": 21
    }
  ],
  "usage": {
    "prompt_tokens": 86,
    "completion_tokens": 26,
    "total_tokens": 112,
    "completion_tokens_details": {
      "reasoning_tokens": 24
    }
  },
  "system_fingerprint": "default"
}

NOTE: matched_stop is integer 21 (token ID for "6") — known limitation L1. Stop fired during reasoning content. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_tokens raised to 1400 to ensure "6" appears within the reasoning budget.

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

5.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
    "stop": ["hello world"],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX (both revisions): HTTP 400 (different error codes)
  • vLLM: HTTP 200, matched_stop: "hello world"

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

5.1.3 stop_token_ids

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Repeat: hi 1 2 3 4 5 6 7"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 1400
  }' | jq

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop_token_ids": [20, 21],
    "stream": false,
    "max_tokens": 100
  }' | jq '{finish_reason: .choices[0].finish_reason, matched_stop: .choices[0].matched_stop}'

Expected:

  • MLX (both revisions): HTTP 200, matched_stop: 20
  • vLLM: HTTP 200, matched_stop: 20

MLX baseline result:

{
  "finish_reason": "stop",
  "matched_stop": 20
}

MLX HEAD result:

{
  "id": "chatcmpl-019e51b7-d14c-7ab2-934b-a5198c26dc00",
  "object": "chat.completion",
  "created": 1779487461,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to interpret user request: \"Repeat: hi 1 2 3 4 5"
      },
      "finish_reason": "stop",
      "matched_stop": 20
    }
  ],
  "usage": {
    "prompt_tokens": 86,
    "completion_tokens": 24,
    "total_tokens": 110,
    "completion_tokens_details": {
      "reasoning_tokens": 22
    }
  },
  "system_fingerprint": "default"
}

NOTE: matched_stop is integer 20 (token ID for "5") — known limitation L1. Stop fired during reasoning content. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_tokens raised to 1400 to ensure the stop token appears within the reasoning budget.

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

5.2 Streaming


5.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "messages": [{"role": "user", "content": "Repeat: hi 1 2 3 4 5 6 7"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 1400
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
    "stop": ["6"],
    "stream": true,
    "max_tokens": 100
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: SSE — final chunk finish_reason: "stop", matched_stop: 21 (integer)
  • vLLM: SSE — final chunk finish_reason: "stop", matched_stop: "6" (string)

MLX baseline result:

(no output — HTTP 400 before SSE stream; grep pipeline produces no matching lines)

MLX HEAD result:

// second-to-last chunk — stop token emitted in delta
{"id":"chatcmpl-019e51bb-31fd-7f41-982f-99215064cbde","object":"chat.completion.chunk","created":1779487683,"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":"6"},"logprobs":null,"finish_reason":null}]}

// last chunk — terminal signal
{"id":"chatcmpl-019e51bb-31fd-7f41-982f-99215064cbde","object":"chat.completion.chunk","created":1779487683,"model":"mlx-community/gpt-oss-20b-MXFP4-Q4","system_fingerprint":"default","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"stop","matched_stop":21}]}

NOTE: Second-to-last chunk emits the stop token "6" in reasoning_content; final chunk has finish_reason: "stop", matched_stop: 21 (integer — known limitation L1). Stop fired during reasoning content. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_tokens raised to 1400 to ensure "6" appears within the reasoning budget.

vLLM result:

SKIPPED — vLLM Harmony pipeline requires a dedicated GPU with sufficient VRAM to load GPT-OSS models. Hardware not available for this test run.

6. Harmony Responses (/v1/responses + GPT-OSS) — NEW for MLX

MLX model: mlx-community/gpt-oss-20b-MXFP4-Q4
vLLM model: openai/gpt-oss-20b

Responses API: Only stop (string array) — no stop_token_ids field.

vLLM note: vLLM silently drops stop on Harmony Responses (upstream gap — stop: vec![] in
build_grpc_sampling_params_from_responses). MLX now handles this path where vLLM does not.

status: Responses API reports stop-sequence termination as "completed".

6.1 Non-streaming


6.1.1 Single-token string stop — was HTTP 400 at baseline, now fixed

MLX:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "input": "Repeat: hi 1 2 3 4 5 6 7",
    "stop": ["6"],
    "max_output_tokens": 1400
  }' | jq .

vLLM:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "input": "Count from 1 to 10, one number per line",
    "stop": ["6"],
    "max_output_tokens": 100
  }' | jq '{status, output_text: (.output[] | select(.type == "message") | .content[] | select(.type == "output_text") | .text)}'

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: HTTP 200, status: "completed", output ends at "6"
  • vLLM: ⚠️ HTTP 200 but stop silently dropped — model generates all 10 numbers

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "id": "responses-019e51bd-9929-7720-9e21-63c3efb35f01",
  "object": "response",
  "created_at": 1779487840,
  "status": "completed",
  "max_output_tokens": 1400,
  "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
  "output": [
    {
      "type": "reasoning",
      "id": "reasoning_responses-019e51bd-9929-7720-9e21-63c3efb35f01",
      "content": [
        {
          "type": "reasoning_text",
          "text": "We need to interpret user request. They say: \"Repeat: hi 1 2 3 4 5 6"
        }
      ],
      "status": "completed"
    }
  ],
  "parallel_tool_calls": true,
  "store": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [],
  "usage": {
    "input_tokens": 82,
    "output_tokens": 29,
    "total_tokens": 111,
    "output_tokens_details": {
      "reasoning_tokens": 27
    }
  },
  "metadata": {}
}

NOTE: status: "completed", reasoning content stops at "6" — stop sequence fired correctly. No matched_stop field in Responses API response. Stop fired during reasoning before any message output block was produced. Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_output_tokens raised to 1400.

vLLM result:

(paste here — expect full 1–10 output; confirms vLLM gap)

6.1.2 Multi-token string stop

MLX:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "input": "Say: hi there and hello world!",
    "stop": ["hello world"],
    "max_output_tokens": 100
  }' | jq .

vLLM:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "input": "Say: hi there and hello world!",
    "stop": ["hello world"],
    "max_output_tokens": 100
  }' | jq '{status}'

Expected:

  • MLX (both revisions): HTTP 400 (different error codes)
  • vLLM: ⚠️ HTTP 200, stop silently dropped

MLX baseline result:

{
  "error": {
    "type": "Bad Request",
    "code": "invalid_request_parameters",
    "message": "Invalid request parameters: MLX backend does not support string stop sequences",
    "param": null
  }
}

MLX HEAD result:

{
  "error": {
    "type": "Bad Request",
    "code": "unsupported_stop_string",
    "message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
    "param": null
  }
}

vLLM result:

(paste here — expect 200 with full output; confirms vLLM gap)

6.2 Streaming


6.2.1 Single-token string stop

MLX:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gpt-oss-20b-MXFP4-Q4",
    "input": "Repeat: hi 1 2 3 4 5 6 7",
    "stop": ["6"],
    "max_output_tokens": 1400,
    "stream": true
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

vLLM:

curl http://localhost:3000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "input": "Count from 1 to 10, one number per line",
    "stop": ["6"],
    "max_output_tokens": 100,
    "stream": true
  }' | grep "^data:" | grep -v "\[DONE\]" | tail -2 | sed 's/^data: //' | jq .

Expected:

  • MLX baseline: HTTP 400
  • MLX HEAD: SSE — response.content_part.done with text ending at "6", then response.completed
  • vLLM: ⚠️ SSE — stop silently dropped, output contains all 10 numbers

MLX baseline result:

{
  "type": "error",
  "code": "pipeline_error",
  "message": "Pipeline execution failed: Response { status: 400, version: HTTP/1.1, headers: {\"content-type\": \"application/json\", \"x-smg-error-code\": \"invalid_request_parameters\"}, body: Body(UnsyncBoxBody) }",
  "param": null,
  "sequence_number": 2
}
NOTE: Responses API streaming emits the 400 as an SSE error event, so grep captures it (unlike other streaming endpoints).

MLX HEAD result:

// second-to-last event — reasoning item finalised
{"type":"response.output_item.done","sequence_number":3,"output_index":0,"item":{"id":"rs_019e51c553807dd0a7e580e6714b69d4","type":"reasoning","summary":[],"content":null,"encrypted_content":null,"status":null}}

// last event — response completed
{"type":"response.completed","sequence_number":4,"response":{"id":"resp_019e51c5-48df-7043-a6a2-b933db7dca73","object":"response","created_at":1779488344,"status":"completed","model":"mlx-community/gpt-oss-20b-MXFP4-Q4","output":[{"id":"rs_019e51c553807dd0a7e580e6714b69d4","type":"reasoning","summary":[],"content":null,"encrypted_content":null,"status":null}],"usage":{"input_tokens":82,"output_tokens":29,"total_tokens":111},"max_output_tokens":1400,"temperature":1.0,"parallel_tool_calls":true,"store":true,"tools":[],"metadata":{},"tool_choice":"auto"}}

NOTE: status: "completed", output_tokens: 29 matches the non-streaming 6.1.1 result — stop fired correctly. No matched_stop in the Responses API streaming events (known limitation L10). Reasoning content is not surfaced in the final response.output_item.done event (content: null). Prompt changed to "Repeat: hi 1 2 3 4 5 6 7" and max_output_tokens raised to 1400.

vLLM result:

(paste here — expect full 1–10 output; confirms vLLM gap)

Quick test matrix

Symbol key

Symbol Meaning
HTTP 200, correct behavior
HTTP 4xx/5xx
⚠️ HTTP 200 but incorrect behavior

String stop tests — core before/after comparison

# Pipeline Path Stop MLX baseline MLX HEAD vLLM
1.1.1 Regular Chat non-stream single-token "6" ❌ 400 matched_stop:"6"
1.1.2 Regular Chat non-stream multi-token "hello world" ❌ 400 ❌ 400 (diff msg)
1.1.4 Regular Chat non-stream "5" + ids [21] ❌ 400 matched_stop:"5"
1.2.1 Regular Chat stream single-token "6" ❌ 400 ✅ SSE matched_stop:"6"
1.2.4 Regular Chat stream "5" + ids [21] ❌ 400 matched_stop:"5"
2.1.1 Regular Completion non-stream single-token "6" ❌ 400 matched_stop:"6"
2.1.4 Regular Completion non-stream "5" + ids [21] ❌ 400 matched_stop:"5"
2.2.1 Regular Completion stream single-token "6" ❌ 400 ✅ (no matched_stop)
3.1.1 Regular Messages non-stream single-token "6" ❌ 400 stop_sequence:"6"
3.2.1 Regular Messages stream single-token "6" ❌ 400 ✅ SSE stop_sequence:"6"
4.1.1 Regular Generate non-stream single-token "6" ❌ 400 matched_stop:21 (int) ✅ (string)
4.2.1 Regular Generate stream single-token "6" ❌ 400 ✅ (stop token in text, no matched_stop)
5.1.1 Harmony Chat non-stream single-token "6" ❌ 400 matched_stop:21 (int) ✅ (string)
5.2.1 Harmony Chat stream single-token "6" ❌ 400 ✅ SSE matched_stop:21
6.1.1 Harmony Responses non-stream single-token "6" ❌ 400 status:"completed" ⚠️ dropped
6.2.1 Harmony Responses stream single-token "6" ❌ 400 ✅ SSE response.completed ⚠️ dropped

stop_token_ids regression — must pass at both revisions

# Pipeline Path Stop MLX baseline MLX HEAD
1.1.3 Regular Chat non-stream ids [20,21] matched_stop:20 matched_stop:20
1.2.3 Regular Chat stream ids [20,21] matched_stop:20 matched_stop:20
2.1.3 Regular Completion non-stream ids [20,21] matched_stop:20 matched_stop:20
4.1.3 Regular Generate non-stream ids [20,21] matched_stop:20 matched_stop:20
5.1.3 Harmony Chat non-stream ids [20,21] matched_stop:20 matched_stop:20

@zach-li-sudo
Copy link
Copy Markdown
Contributor Author

Here's the e2e test PR for this feature: #1538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes protocols Protocols crate changes tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant