Surface native reasoning from openai-compatible streaming responses#129
Surface native reasoning from openai-compatible streaming responses#129simpolism wants to merge 1 commit into
Conversation
The openai-compatible streaming parser only read delta.content and post-hoc <think> tags from the final text. OpenAI-compatible servers that run a reasoning parser (e.g. vLLM) stream chain-of-thought in a separate delta.reasoning / delta.reasoning_content field, with delta.content null until the reasoning phase ends. Those reasoning deltas were silently dropped, so extended thinking never reached the client even though the model emitted it. Accumulate native reasoning deltas into a thinking content block (index 0) and stream it via onChunk, then merge with any inline <think> blocks at completion. Mirrors the reasoning-passthrough already implemented in the OpenRouter service.
Greptile SummaryThis PR surfaces native chain-of-thought from OpenAI-compatible streaming responses by reading
Confidence Score: 4/5Safe to merge; the change is additive and isolated to the streaming parser, with no impact on models that don't emit native reasoning fields. The implementation correctly follows the pattern already proven in the OpenRouter service. Minor gaps remain: the deprecated-claude-app/backend/src/services/openai-compatible.ts — specifically the
|
| Filename | Overview |
|---|---|
| deprecated-claude-app/backend/src/services/openai-compatible.ts | Adds native reasoning-delta passthrough; core logic correctly mirrors OpenRouter's implementation, but reasoning array/object formats aren't handled and concurrent native+inline reasoning could produce duplicate thinking blocks. |
Sequence Diagram
sequenceDiagram
participant vLLM as vLLM / OpenAI-Compatible Server
participant Service as openai-compatible.ts
participant onChunk as onChunk Handler
vLLM->>Service: SSE delta (reasoning_content: "Let me think...")
Service->>Service: "reasoningContent += text"
Service->>Service: "reasoningBlocks[0] = {type:'thinking', thinking: reasoningContent}"
Service->>onChunk: onChunk('', false, reasoningBlocks)
vLLM->>Service: SSE delta (reasoning_content: " more thinking")
Service->>Service: "reasoningContent += text"
Service->>Service: reasoningBlocks[0] updated
Service->>onChunk: onChunk('', false, reasoningBlocks)
vLLM->>Service: SSE delta (content: "Hello!")
Service->>Service: "fullContent += text"
Service->>onChunk: onChunk('Hello!', false, reasoningBlocks)
vLLM->>Service: SSE data: [DONE]
Service->>Service: "inlineBlocks = parseThinkingTags(fullContent)"
Service->>Service: "contentBlocks = [...reasoningBlocks, ...inlineBlocks]"
Service->>onChunk: onChunk('', true, contentBlocks)
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
deprecated-claude-app/backend/src/services/openai-compatible.ts:158-160
**Narrower `reasoning` handling than OpenRouter reference**
The OpenRouter service handles `delta.reasoning` as a string, an array of `{text: string}` objects, or a plain object with a `.text` property. This service only handles the string form. Any OpenAI-compatible server that emits `delta.reasoning` in array or object format (e.g. servers following a newer streaming spec) would silently drop that reasoning content. Aligning with OpenRouter's multi-format handling would future-proof this code.
### Issue 2 of 3
deprecated-claude-app/backend/src/services/openai-compatible.ts:164
`reasoningBlocks.unshift()` on an empty array is functionally identical to `push()` here, but `unshift` implies prepending ahead of existing elements, which is misleading when the array is empty. Using `push` communicates intent more clearly.
```suggestion
reasoningBlocks.push({ type: 'thinking', thinking: '' });
```
### Issue 3 of 3
deprecated-claude-app/backend/src/services/openai-compatible.ts:140-145
**Duplicate thinking blocks possible when server emits both fields**
If a server emits native reasoning via `delta.reasoning_content` AND also wraps that same content in `<think>...</think>` inside `delta.content`, `parseThinkingTags(fullContent)` at `[DONE]` would extract those inline tags into `inlineBlocks` and they would be appended to `reasoningBlocks` in the merge. The client would then see the same reasoning content twice — once from the native field and once from the parsed tags. Adding a guard to skip `parseThinkingTags` when `hasReasoningStarted` is true (or when `reasoningBlocks.length > 0`) would prevent this.
Reviews (1): Last reviewed commit: "fix: surface native reasoning from opena..." | Re-trigger Greptile
| } else if (delta?.reasoning && typeof delta.reasoning === 'string') { | ||
| reasoningText = delta.reasoning; | ||
| } |
There was a problem hiding this comment.
Narrower
reasoning handling than OpenRouter reference
The OpenRouter service handles delta.reasoning as a string, an array of {text: string} objects, or a plain object with a .text property. This service only handles the string form. Any OpenAI-compatible server that emits delta.reasoning in array or object format (e.g. servers following a newer streaming spec) would silently drop that reasoning content. Aligning with OpenRouter's multi-format handling would future-proof this code.
Prompt To Fix With AI
This is a comment left during a code review.
Path: deprecated-claude-app/backend/src/services/openai-compatible.ts
Line: 158-160
Comment:
**Narrower `reasoning` handling than OpenRouter reference**
The OpenRouter service handles `delta.reasoning` as a string, an array of `{text: string}` objects, or a plain object with a `.text` property. This service only handles the string form. Any OpenAI-compatible server that emits `delta.reasoning` in array or object format (e.g. servers following a newer streaming spec) would silently drop that reasoning content. Aligning with OpenRouter's multi-format handling would future-proof this code.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| const inlineBlocks = this.parseThinkingTags(fullContent); | ||
| const contentBlocks = [...reasoningBlocks, ...inlineBlocks]; | ||
| if (hasReasoningStarted) { | ||
| console.log(`[OpenAI-Compatible] 🧠 Reasoning complete: ${reasoningContent.length} chars`); | ||
| } | ||
| await onChunk('', true, contentBlocks.length > 0 ? contentBlocks : undefined); |
There was a problem hiding this comment.
Duplicate thinking blocks possible when server emits both fields
If a server emits native reasoning via delta.reasoning_content AND also wraps that same content in <think>...</think> inside delta.content, parseThinkingTags(fullContent) at [DONE] would extract those inline tags into inlineBlocks and they would be appended to reasoningBlocks in the merge. The client would then see the same reasoning content twice — once from the native field and once from the parsed tags. Adding a guard to skip parseThinkingTags when hasReasoningStarted is true (or when reasoningBlocks.length > 0) would prevent this.
Prompt To Fix With AI
This is a comment left during a code review.
Path: deprecated-claude-app/backend/src/services/openai-compatible.ts
Line: 140-145
Comment:
**Duplicate thinking blocks possible when server emits both fields**
If a server emits native reasoning via `delta.reasoning_content` AND also wraps that same content in `<think>...</think>` inside `delta.content`, `parseThinkingTags(fullContent)` at `[DONE]` would extract those inline tags into `inlineBlocks` and they would be appended to `reasoningBlocks` in the merge. The client would then see the same reasoning content twice — once from the native field and once from the parsed tags. Adding a guard to skip `parseThinkingTags` when `hasReasoningStarted` is true (or when `reasoningBlocks.length > 0`) would prevent this.
How can I resolve this? If you propose a fix, please make it concise.
The
openai-compatiblestreaming parser only readdelta.contentand post-hoc<think>tags from the final text. OpenAI-compatible servers that run a reasoning parser (e.g. vLLM) stream chain-of-thought in a separatedelta.reasoning/delta.reasoning_contentfield, withdelta.contentnull until the reasoning phase ends. Those reasoning deltas were silently dropped, so extended thinking never reached the client even though the model emitted it.Fix
Accumulate native reasoning deltas into a
thinkingcontent block (kept at index 0) and stream it viaonChunk, then merge with any inline<think>blocks parsed at completion. This mirrors the reasoning-passthrough already implemented in the OpenRouter service (reasoning_contentpreferred,reasoningstring as fallback).Models that inline
<think>...</think>incontentare unaffected — that path still runs at[DONE]viaparseThinkingTags, and the two sources are merged (native reasoning at index 0, inline blocks appended).Files
backend/src/services/openai-compatible.ts