Skip to content

thinking_budget is silently ignored on /v1/completions #1825

@efortin

Description

@efortin

Environment

  • omlx main @ 6ae5142
  • macOS, Apple Silicon (M5 Pro, 64 GB)
  • Model: mlx-community/Qwen3.6-35B-A3B-nvfp4, thinking enabled

Symptom

thinking_budget works on /v1/chat/completions (and budget_tokens on /v1/messages), but on /v1/completions it is silently ignored: send it, get no error, and the model reasons to its natural length anyway.

curl -s http://localhost:1234/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-35B-A3B-nvfp4",
    "prompt": "<|im_start|>user\nExplain why the sky is blue.<|im_end|>\n<|im_start|>assistant\n<think>\n",
    "max_tokens": 2500,
    "thinking_budget": 300
  }'

Expected: thinking bounded around 300 tokens. Actual: ~1450 thinking tokens, exactly like the same request without the parameter.

Cause

CompletionRequest has no thinking_budget field, so Pydantic drops it at parsing, and neither completion path resolves a budget for the engine. Raw completions are otherwise a natural fit: a prompt ending with an open <think> already passes the scheduler's needs_think_prefix gate, so enforcement works as soon as the value reaches the engine.

Fix

Up in #1821: add the field and thread it through both completion paths via the existing _resolve_thinking_budget helper. With it: budget 300 → 299 thinking tokens, deterministic at temperature 0 (100 → 99, 50 → 49).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions