Skip to content

fix(reasoning/qwen3): route bare-text 'thinking process:' preamble to reasoning_content (#570)#573

Open
raullenchai wants to merge 2 commits into
mainfrom
fix/qwen3-parser-bare-think-prefix-v2
Open

fix(reasoning/qwen3): route bare-text 'thinking process:' preamble to reasoning_content (#570)#573
raullenchai wants to merge 2 commits into
mainfrom
fix/qwen3-parser-bare-think-prefix-v2

Conversation

@raullenchai

Copy link
Copy Markdown
Owner

Summary

Closes #570.

Qwen3ReasoningParser now recognises bare-text "thinking process" preambles at the start of an assistant turn and routes them to reasoning_content instead of letting them leak into the user-facing content field.

Root cause

Qwen3 chat templates inject <think>\n after the assistant generation marker when enable_thinking=True, putting the model in implicit-think mode — the model is supposed to emit its chain-of-thought followed by </think> and then the user-facing answer.

In practice the model occasionally restates the channel boundary inline with a bare-text preamble like:

Here's a thinking process:

1.  **Analyze User Input:**
   - **Route:** Seattle to San Diego
2.  **Evaluate Each Option:**
   - Portland, OR: World-renowned food scene.

When the request also runs out of max_tokens before the model reaches </think>, neither tag appears anywhere in the model output. The previous Qwen3ReasoningParser.extract_reasoning "no end token, no start token" branch then returned (None, model_output) — routing the whole chain-of-thought into the user-facing content field while reasoning_content stayed empty. Any OpenAI-compatible client consuming /v1/chat/completions saw raw reasoning leak into the answer.

The bug reproduces both on multi-turn and single-turn prompts; the trigger is "model emits bare-text thinking preamble AND output is truncated before </think>", which becomes more likely as conversation length and max_tokens interact.

Fix

Extend Qwen3ReasoningParser to recognise bare-text "thinking process" preambles and route them to reasoning_content with cleaned_text cleared. Applies to both non-streaming extract_reasoning and the finalize_streaming correction path — matching the existing streaming heuristic in BaseThinkingReasoningParser that already routes pre-tag text to reasoning while </think> has not yet been seen.

The regex is scoped narrowly to explicit scratchpad labels — preambles that end with : and use known scratchpad nouns. Conversational openers (Let me think..., I need to analyze..., I'll analyze...) are deliberately NOT matched: they are common direct-answer phrasings when enable_thinking=False, and matching them would blank out valid responses.

Recognised preambles (matched at ^\s*, case-insensitive, all require terminating :):

  • Here's a thinking process:, Here is my reasoning:, Here's the chain-of-thought:, Here is the scratchpad:, Here's the thought process:, Here is the reasoning process:
  • Thinking step by step:, Thinking out loud:, Thinking through this:, Thinking carefully:, Thinking aloud:
  • Step by step: / Step-by-step:
  • My thought process: / My reasoning process:

Why server-side is the right layer

Two alternative paths were considered:

  1. Patch the chat template to keep <think> open across continuation turns. The chat template is owned by the upstream HF tokenizer config (Qwen3.6-A3B), so we'd have to maintain a local override; any HF refresh would silently revert the fix. Captured for future upstream contribution; not blocked on it.
  2. Inject <think> server-side at the prompt boundary. Would mask the symptom but bare-text preambles also occur in single-turn prompts (the trigger is "truncated before </think>", not "continuation turn"). Server-side parser remains the right place to be defensive.

Regression coverage

  • tests/test_reasoning_parsers.py::TestQwen3 — bare-text-fallback positive cases (12 explicit-label variants), the explicit-tag passthrough, the streaming finalizer behavior, and a dedicated test_ambiguous_phrases_not_misclassified negative test that pins the conservative scope (9 conversational openers that must stay as content).
  • tests/test_chat_route_tool_tag_leak.py::TestBareThinkingProcessLeakRegression — drives the real _finalize_content_and_reasoning orchestration used by /v1/chat/completions, including the exact text captured from a failing request against qwen3.6-35b-4bit.

Scope

  • No version bump (parser fix; release SOP 11-gate walk happens at the next bundled cut).
  • Diff limited to one parser file + two test files (+335 / -2).
  • No alias changes, no template changes, no upstream pin changes.

Test plan

  • python3.12 -m pytest tests/test_reasoning_parsers.py tests/test_chat_route_tool_tag_leak.py — 142 passed
  • Broader sweep: python3.12 -m pytest tests/ -k "reason or qwen or think or finalize or tool_tag" --ignore=tests/integrations --ignore=tests/test_streaming_simulator.py — 1003 passed
  • ruff check + ruff format --check on touched files — clean
  • Direct verification on the captured failing payload via _finalize_content_and_reasoningreasoning_content populates with the chain-of-thought, content clears

… reasoning_content (#570)

Qwen3 chat templates inject `<think>\n` after the assistant generation
marker when `enable_thinking=True`, putting the model in implicit-think
mode. The model is supposed to emit its chain-of-thought followed by
`</think>` and then the user-facing answer. In practice the model
occasionally restates the channel boundary inline as a bare-text
preamble (`Here's a thinking process:\n\n1. **Analyze...`); when the
request also runs out of `max_tokens` before the model reaches
`</think>`, neither tag appears anywhere in the output.

The previous `Qwen3ReasoningParser.extract_reasoning` "no end token,
no start token" branch then returned `(None, model_output)` — routing
the whole chain-of-thought into the user-facing `content` field while
`reasoning_content` stayed empty. Any OpenAI-compatible client
consuming `/v1/chat/completions` saw raw reasoning leak into the
answer.

Fix: recognise bare-text "thinking process" preambles at the very
start of the output and route them to `reasoning_content` with
`cleaned_text` cleared. The regex anchors to `^\s*` and is scoped
narrowly to explicit scratchpad labels — preambles that end with `:`
and use known scratchpad nouns (`thinking process`, `reasoning`,
`chain-of-thought`, `scratchpad`, `thought process`, `reasoning
process`) or the labelled `Thinking step by step:` / `Step by step:`
/ `My thought process:` forms. Conversational openers like `Let me
think...` or `I need to analyze...` are deliberately NOT matched —
they are common direct-answer phrasings and matching them would blank
out valid responses when `enable_thinking=False`.

Applies to both non-streaming `extract_reasoning` and the
`finalize_streaming` correction path, matching the existing streaming
heuristic in `BaseThinkingReasoningParser` that already routes
pre-tag text to reasoning while `</think>` has not yet been seen.

Regression coverage:

- `tests/test_reasoning_parsers.py::TestQwen3` — bare-text-fallback
  positive cases, the false-positive guard for normal answers that
  mention `think` mid-sentence, the explicit-tag passthrough, the
  streaming finalizer behavior, and an `test_ambiguous_phrases_not_
  misclassified` negative test that pins the conservative scope.
- `tests/test_chat_route_tool_tag_leak.py::TestBareThinkingProcessLeakRegression`
  — drives the real `_finalize_content_and_reasoning` orchestration
  used by `/v1/chat/completions`, including the exact text captured
  from a failing request against `qwen3.6-35b-4bit`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown

PR #573 validation scorecard

Title: fix(reasoning/qwen3): route bare-text 'thinking process:' preamble to reasoning_content (#570)
Author: raullenchai
Diff: 3 file(s), +425/-2 LOC, blast radius: medium

Verdict: MERGE-SAFE

step status summary time
fetch PASS 3 files, +425/-2 LOC, blast=medium 0.9s
test_plan_check PASS all 4 test-plan item(s) checked 0.0s
cl_description_quality PASS title OK + body has rationale (4856 chars) 0.0s
codex_review skip codex CLI not found on PATH (install: npm i -g @openai/codex) 0.0s
supply_chain PASS no hooks touched, no suspicious patterns, deps clean 0.0s
lint PASS clean (3 file(s)) 0.0s
targeted_tests PASS exit 2 (in 2 target file(s)) 2.0s

…p bare Step-by-step (codex r2)

Codex round-2 review on PR #573 caught two further BLOCKING items:

* **r2 B1 — tool-call markup leaked into reasoning_content.** When
  ``_finalize_content_and_reasoning`` runs with ``tool_calls`` set, it
  calls ``extract_reasoning(raw_text)`` on the unstripped raw output.
  If that output both starts with a recognised bare-text preamble
  AND embeds ``<tool_call>...`` markup, the bare-text branch would
  return the whole raw output (markup and all) as reasoning_content
  — exposing in the reasoning channel the same tags the tool parser
  already stripped from ``content``. Add a tool-tag detector
  (``<tool_call>``, ``<function=``, ``<|tool_call|>``, ``<invoke ``,
  ``<minimax:tool_call>``) and skip the fallback whenever it fires;
  defer to the upstream tool/text pipeline for that case.

* **r2 B2 — bare ``Step by step:`` / ``Step-by-step:`` is a normal
  answer heading.** Users frequently ask "explain step by step" and
  the canonical answer heading is exactly ``Step by step:`` — matching
  it would clobber legitimate tutorials. Drop those two alternates
  from the regex. The labelled ``Thinking step by step:`` form (which
  is unambiguous scratchpad) is retained.

New negative-coverage tests:

* ``test_bare_think_prefix_with_tool_call_markup_not_routed`` — five
  tool-tag flavors (matching every variant ``test_chat_route_tool_tag_
  leak`` recognises) + the combined "preamble + tool tag" payload.
* ``test_ambiguous_phrases_not_misclassified`` extended with two
  ``Step by step:`` / ``Step-by-step:`` direct-answer cases.
* ``test_bare_thinking_preamble_with_tool_call_does_not_leak_markup``
  drives the full ``_drive_chat_route_pipeline`` (tool parser +
  reasoning parser + finalize helper) to confirm neither sink leaks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qwen3.6-35b-a3b 4bit: 'Here's a thinking process:' leaks into content body on multi-turn

1 participant