Skip to content

Surface native reasoning from openai-compatible streaming responses#129

Open
simpolism wants to merge 1 commit into
anima-research:mainfrom
simpolism:jake.openai-compatible-reasoning-passthrough
Open

Surface native reasoning from openai-compatible streaming responses#129
simpolism wants to merge 1 commit into
anima-research:mainfrom
simpolism:jake.openai-compatible-reasoning-passthrough

Conversation

@simpolism
Copy link
Copy Markdown
Contributor

The openai-compatible streaming parser only read delta.content and post-hoc <think> tags from the final text. OpenAI-compatible servers that run a reasoning parser (e.g. vLLM) stream chain-of-thought in a separate delta.reasoning / delta.reasoning_content field, with delta.content null until the reasoning phase ends. Those reasoning deltas were silently dropped, so extended thinking never reached the client even though the model emitted it.

Fix

Accumulate native reasoning deltas into a thinking content block (kept at index 0) and stream it via onChunk, then merge with any inline <think> blocks parsed at completion. This mirrors the reasoning-passthrough already implemented in the OpenRouter service (reasoning_content preferred, reasoning string as fallback).

Models that inline <think>...</think> in content are unaffected — that path still runs at [DONE] via parseThinkingTags, and the two sources are merged (native reasoning at index 0, inline blocks appended).

Files

  • backend/src/services/openai-compatible.ts

The openai-compatible streaming parser only read delta.content and post-hoc
<think> tags from the final text. OpenAI-compatible servers that run a reasoning
parser (e.g. vLLM) stream chain-of-thought in a separate delta.reasoning /
delta.reasoning_content field, with delta.content null until the reasoning phase
ends. Those reasoning deltas were silently dropped, so extended thinking never
reached the client even though the model emitted it.

Accumulate native reasoning deltas into a thinking content block (index 0) and
stream it via onChunk, then merge with any inline <think> blocks at completion.
Mirrors the reasoning-passthrough already implemented in the OpenRouter service.
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR surfaces native chain-of-thought from OpenAI-compatible streaming responses by reading delta.reasoning_content / delta.reasoning fields that servers like vLLM emit during the reasoning phase — deltas that were previously silently dropped. The approach mirrors the existing OpenRouter reasoning-passthrough and merges native blocks with any inline <think> blocks at stream completion.

  • Introduces three state variables (reasoningBlocks, reasoningContent, hasReasoningStarted) to track streaming thinking content, streams it via onChunk at each delta, and merges it with post-hoc parseThinkingTags output at [DONE].
  • Content chunks during and after the reasoning phase pass the accumulated reasoningBlocks array alongside each text delta, keeping the client's in-progress thinking block up-to-date throughout streaming.

Confidence Score: 4/5

Safe to merge; the change is additive and isolated to the streaming parser, with no impact on models that don't emit native reasoning fields.

The implementation correctly follows the pattern already proven in the OpenRouter service. Minor gaps remain: the reasoning field is only handled as a string, and a model emitting reasoning in both the native field and inline tags could produce duplicate thinking blocks.

deprecated-claude-app/backend/src/services/openai-compatible.ts — specifically the reasoning field fallback parsing and the [DONE] merge logic.

Important Files Changed

Filename Overview
deprecated-claude-app/backend/src/services/openai-compatible.ts Adds native reasoning-delta passthrough; core logic correctly mirrors OpenRouter's implementation, but reasoning array/object formats aren't handled and concurrent native+inline reasoning could produce duplicate thinking blocks.

Sequence Diagram

sequenceDiagram
    participant vLLM as vLLM / OpenAI-Compatible Server
    participant Service as openai-compatible.ts
    participant onChunk as onChunk Handler

    vLLM->>Service: SSE delta (reasoning_content: "Let me think...")
    Service->>Service: "reasoningContent += text"
    Service->>Service: "reasoningBlocks[0] = {type:'thinking', thinking: reasoningContent}"
    Service->>onChunk: onChunk('', false, reasoningBlocks)

    vLLM->>Service: SSE delta (reasoning_content: " more thinking")
    Service->>Service: "reasoningContent += text"
    Service->>Service: reasoningBlocks[0] updated
    Service->>onChunk: onChunk('', false, reasoningBlocks)

    vLLM->>Service: SSE delta (content: "Hello!")
    Service->>Service: "fullContent += text"
    Service->>onChunk: onChunk('Hello!', false, reasoningBlocks)

    vLLM->>Service: SSE data: [DONE]
    Service->>Service: "inlineBlocks = parseThinkingTags(fullContent)"
    Service->>Service: "contentBlocks = [...reasoningBlocks, ...inlineBlocks]"
    Service->>onChunk: onChunk('', true, contentBlocks)
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
deprecated-claude-app/backend/src/services/openai-compatible.ts:158-160
**Narrower `reasoning` handling than OpenRouter reference**

The OpenRouter service handles `delta.reasoning` as a string, an array of `{text: string}` objects, or a plain object with a `.text` property. This service only handles the string form. Any OpenAI-compatible server that emits `delta.reasoning` in array or object format (e.g. servers following a newer streaming spec) would silently drop that reasoning content. Aligning with OpenRouter's multi-format handling would future-proof this code.

### Issue 2 of 3
deprecated-claude-app/backend/src/services/openai-compatible.ts:164
`reasoningBlocks.unshift()` on an empty array is functionally identical to `push()` here, but `unshift` implies prepending ahead of existing elements, which is misleading when the array is empty. Using `push` communicates intent more clearly.

```suggestion
                  reasoningBlocks.push({ type: 'thinking', thinking: '' });
```

### Issue 3 of 3
deprecated-claude-app/backend/src/services/openai-compatible.ts:140-145
**Duplicate thinking blocks possible when server emits both fields**

If a server emits native reasoning via `delta.reasoning_content` AND also wraps that same content in `<think>...</think>` inside `delta.content`, `parseThinkingTags(fullContent)` at `[DONE]` would extract those inline tags into `inlineBlocks` and they would be appended to `reasoningBlocks` in the merge. The client would then see the same reasoning content twice — once from the native field and once from the parsed tags. Adding a guard to skip `parseThinkingTags` when `hasReasoningStarted` is true (or when `reasoningBlocks.length > 0`) would prevent this.

Reviews (1): Last reviewed commit: "fix: surface native reasoning from opena..." | Re-trigger Greptile

Comment on lines +158 to +160
} else if (delta?.reasoning && typeof delta.reasoning === 'string') {
reasoningText = delta.reasoning;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Narrower reasoning handling than OpenRouter reference

The OpenRouter service handles delta.reasoning as a string, an array of {text: string} objects, or a plain object with a .text property. This service only handles the string form. Any OpenAI-compatible server that emits delta.reasoning in array or object format (e.g. servers following a newer streaming spec) would silently drop that reasoning content. Aligning with OpenRouter's multi-format handling would future-proof this code.

Prompt To Fix With AI
This is a comment left during a code review.
Path: deprecated-claude-app/backend/src/services/openai-compatible.ts
Line: 158-160

Comment:
**Narrower `reasoning` handling than OpenRouter reference**

The OpenRouter service handles `delta.reasoning` as a string, an array of `{text: string}` objects, or a plain object with a `.text` property. This service only handles the string form. Any OpenAI-compatible server that emits `delta.reasoning` in array or object format (e.g. servers following a newer streaming spec) would silently drop that reasoning content. Aligning with OpenRouter's multi-format handling would future-proof this code.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +140 to 145
const inlineBlocks = this.parseThinkingTags(fullContent);
const contentBlocks = [...reasoningBlocks, ...inlineBlocks];
if (hasReasoningStarted) {
console.log(`[OpenAI-Compatible] 🧠 Reasoning complete: ${reasoningContent.length} chars`);
}
await onChunk('', true, contentBlocks.length > 0 ? contentBlocks : undefined);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate thinking blocks possible when server emits both fields

If a server emits native reasoning via delta.reasoning_content AND also wraps that same content in <think>...</think> inside delta.content, parseThinkingTags(fullContent) at [DONE] would extract those inline tags into inlineBlocks and they would be appended to reasoningBlocks in the merge. The client would then see the same reasoning content twice — once from the native field and once from the parsed tags. Adding a guard to skip parseThinkingTags when hasReasoningStarted is true (or when reasoningBlocks.length > 0) would prevent this.

Prompt To Fix With AI
This is a comment left during a code review.
Path: deprecated-claude-app/backend/src/services/openai-compatible.ts
Line: 140-145

Comment:
**Duplicate thinking blocks possible when server emits both fields**

If a server emits native reasoning via `delta.reasoning_content` AND also wraps that same content in `<think>...</think>` inside `delta.content`, `parseThinkingTags(fullContent)` at `[DONE]` would extract those inline tags into `inlineBlocks` and they would be appended to `reasoningBlocks` in the merge. The client would then see the same reasoning content twice — once from the native field and once from the parsed tags. Adding a guard to skip `parseThinkingTags` when `hasReasoningStarted` is true (or when `reasoningBlocks.length > 0`) would prevent this.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant