Environment
- omlx main @
6ae5142
- macOS, Apple Silicon (M5 Pro, 64 GB)
- Model:
mlx-community/Qwen3.6-35B-A3B-nvfp4, thinking enabled
Symptom
thinking_budget works on /v1/chat/completions (and budget_tokens on /v1/messages), but on /v1/completions it is silently ignored: send it, get no error, and the model reasons to its natural length anyway.
curl -s http://localhost:1234/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-35B-A3B-nvfp4",
"prompt": "<|im_start|>user\nExplain why the sky is blue.<|im_end|>\n<|im_start|>assistant\n<think>\n",
"max_tokens": 2500,
"thinking_budget": 300
}'
Expected: thinking bounded around 300 tokens. Actual: ~1450 thinking tokens, exactly like the same request without the parameter.
Cause
CompletionRequest has no thinking_budget field, so Pydantic drops it at parsing, and neither completion path resolves a budget for the engine. Raw completions are otherwise a natural fit: a prompt ending with an open <think> already passes the scheduler's needs_think_prefix gate, so enforcement works as soon as the value reaches the engine.
Fix
Up in #1821: add the field and thread it through both completion paths via the existing _resolve_thinking_budget helper. With it: budget 300 → 299 thinking tokens, deterministic at temperature 0 (100 → 99, 50 → 49).
Environment
6ae5142mlx-community/Qwen3.6-35B-A3B-nvfp4, thinking enabledSymptom
thinking_budgetworks on/v1/chat/completions(andbudget_tokenson/v1/messages), but on/v1/completionsit is silently ignored: send it, get no error, and the model reasons to its natural length anyway.Expected: thinking bounded around 300 tokens. Actual: ~1450 thinking tokens, exactly like the same request without the parameter.
Cause
CompletionRequesthas nothinking_budgetfield, so Pydantic drops it at parsing, and neither completion path resolves a budget for the engine. Raw completions are otherwise a natural fit: a prompt ending with an open<think>already passes the scheduler'sneeds_think_prefixgate, so enforcement works as soon as the value reaches the engine.Fix
Up in #1821: add the field and thread it through both completion paths via the existing
_resolve_thinking_budgethelper. With it: budget 300 → 299 thinking tokens, deterministic at temperature 0 (100 → 99, 50 → 49).