-
-
Notifications
You must be signed in to change notification settings - Fork 25
Thinking
This document describes how Dartantic AI exposes LLM reasoning/thinking capabilities across different providers. "Thinking" or "extended reasoning" refers to models showing their internal reasoning process before providing a final answer.
- Overview
- Generic Architecture
- Provider Implementations
- Usage Patterns
- Testing Strategy
- Implementation Guidelines
Thinking (also called "extended reasoning" or "chain-of-thought") is a capability where LLMs expose their internal reasoning process before generating a final response. This provides:
- Transparency: See how the model approaches problems
- Debugging: Understand why models give certain answers
- Quality: Models often produce better answers when "thinking out loud"
- Education: Learn problem-solving approaches from the model
| Provider | Capability | Status | Configuration |
|---|---|---|---|
| OpenAI Responses | Reasoning Summary | ✅ Implemented | Agent-level enableThinking + optional reasoningSummary
|
| xAI Responses | Reasoning Summary | ✅ Implemented | Agent-level enableThinking
|
| Anthropic | Extended Thinking | ✅ Implemented | Agent-level enableThinking + optional thinkingBudgetTokens
|
| Extended Thinking | ✅ Implemented |
enableThinking; thinkingBudgetTokens or thinkingLevel (Gemini 3+), not both |
|
| Others | N/A | ❌ Not supported | - |
Dartantic AI follows a consistent pattern for thinking across all providers:
graph TD
A[Model generates thinking] --> B{Streaming?}
B -->|Yes| C[Emit thinking deltas in ChatResult.metadata]
B -->|No| D[Accumulate thinking internally]
C --> E[Accumulate in buffer]
D --> E
E --> F[Include accumulated thinking in final ChatResult.metadata]
F --> G{Include in message history?}
G -->|No| H[Filter thinking from ChatMessage.parts]
G -->|Yes| I[Include as ThinkingPart]
style C fill:#bbf,stroke:#333,stroke-width:2px
style F fill:#f9f,stroke:#333,stroke-width:2px
style H fill:#9f9,stroke:#333,stroke-width:2px
-
Agent-Level Configuration: Thinking is enabled via
enableThinking: trueparameter at the Agent constructor level, not in provider-specific options -
Dedicated Surface: Thinking appears as
ThinkingPartinstances in message parts (v3.0.0+) -
Streaming Transparency: During streaming, thinking deltas are emitted as
ThinkingPartmessages for real-time display -
Consolidation: Multiple streaming
ThinkingParts are consolidated into a SINGLEThinkingPartin the final message (same asTextPartconsolidation) -
History Isolation: Thinking is typically NOT sent back to the model in conversation history
- Exception: Anthropic requires thinking blocks (with signatures) to be preserved when tool calls are present
- This is handled transparently by the provider implementation via
_anthropic_thinking_signaturemetadata - Users pay for thinking tokens on every turn when using tools with Anthropic
-
Provider-Specific Fine-Tuning: Options classes (e.g.,
AnthropicChatOptions,GoogleChatModelOptions) contain only provider-specific tuning parameters like token budgets, not the enable flag - Provider Agnostic: Same consumption pattern works across all providers
ThinkingPart consolidation follows the exact same pattern as TextPart consolidation via MessageAccumulator:
During streaming (MessageAccumulator.accumulate):
- Each streaming chunk adds its parts to the accumulated message
- ThinkingParts and TextParts are collected separately
At end of stream (MessageAccumulator.consolidate):
- All TextParts are joined into a SINGLE TextPart
- All ThinkingParts are joined into a SINGLE ThinkingPart
- Parts are ordered: TextPart first, then ThinkingPart, then other parts (ToolParts, etc.)
Invariants (enforced by tests in thinking_consolidation_test.dart):
- Final message has at most ONE TextPart
- Final message has at most ONE ThinkingPart
- TextPart comes before ThinkingPart in parts list
- Streaming-only ThinkingPart messages are filtered by
AgentResponseAccumulator
sequenceDiagram
participant M as Model
participant S as Streaming
participant O as Orchestrator
participant A as Agent
participant U as User
M->>S: Thinking delta
S->>O: Map to ChatResult
O->>A: Yield ChatResult(thinking: delta, output: '', messages: [])
A->>U: chunk.thinking for real-time display
Note over S,O: Accumulate in buffer
M->>S: Text delta
S->>O: Map to ChatResult
O->>A: Yield ChatResult(thinking: null, output: text, messages: [])
A->>U: chunk.output for real-time display
M->>S: Stream complete
S->>O: Consolidate message
O->>A: Yield ChatResult(messages: [consolidated])
A->>U: chunk.messages with ThinkingPart for history
| Aspect | Message Content | Thinking Content |
|---|---|---|
| Streaming Access |
chunk.output (String) |
chunk.thinking (String?) |
| Final Location |
ChatMessage.parts (TextPart) |
ChatMessage.parts (ThinkingPart) |
| Sent to Model | ✅ Yes | ❌ No (by default) |
| Purpose | Conversation content | Transparency/debugging |
| Consolidation | Multiple TextParts → 1 TextPart | Multiple ThinkingParts → 1 ThinkingPart |
| History | Persists | Optional (usually filtered) |
OpenAI's Responses API supports reasoning through the reasoning parameter and exposes it via streaming events.
import 'package:dartantic_ai/dartantic_ai.dart';
// Simple configuration - enable thinking with default settings
final agent = Agent(
'openai-responses:gpt-5',
enableThinking: true, // Automatically uses reasoningSummary: detailed
);
// Advanced configuration - customize reasoning options
final agentAdvanced = Agent(
'openai-responses:gpt-5',
enableThinking: true,
chatModelOptions: const OpenAIResponsesChatModelOptions(
reasoningSummary: OpenAIReasoningSummary.brief, // Override default
reasoningEffort: OpenAIReasoningEffort.high,
),
);enum OpenAIReasoningSummary {
/// Brief reasoning summary (fastest)
brief,
/// Detailed reasoning summary (more comprehensive)
detailed,
/// Provider decides the verbosity level
auto,
}Streaming Events: ResponseReasoningSummaryTextDelta
// In ReasoningEventHandler
Stream<ChatResult<ChatMessage>> _handleReasoningSummaryDelta(
openai.ResponseReasoningSummaryTextDelta event,
EventMappingState state,
) async* {
// Accumulate in buffer
state.thinkingBuffer.write(event.delta);
// Emit as thinking chunk
yield ChatResult<ChatMessage>(
output: const ChatMessage(role: ChatMessageRole.model, parts: []),
messages: const [],
thinking: event.delta,
usage: null,
);
}Final Result: Thinking accumulated as ThinkingPart instances in message parts.
Token Accounting:
- OpenAI charges for full reasoning tokens generated
- Reasoning tokens reported separately in usage
- Token budget controlled by model, not user-configurable
Signature: No cryptographic signature provided
Implementation Note: Unlike Anthropic and Google, OpenAI Responses does not store the enableThinking flag in the ChatModel. Instead, the Provider merges thinking configuration into the options object before creating the ChatModel. This allows for more granular control through the options-based reasoning configuration.
// Thinking is NOT included in message parts
final message = ChatMessage(
role: ChatMessageRole.model,
parts: [
TextPart(text: finalAnswer), // Only the answer, not the thinking
],
);
// Thinking available in result metadata
final result = ChatResult(
output: message,
messages: [message],
metadata: {
'thinking': accumulatedThinkingText, // Full reasoning
},
);Anthropic's Messages API supports extended thinking through the thinking parameter with explicit token budget control.
import 'package:dartantic_ai/dartantic_ai.dart';
// Simple configuration - enable thinking with default budget (4096 tokens)
final agent = Agent(
'anthropic:claude-sonnet-4-5',
enableThinking: true,
);
// Advanced configuration - customize token budget
final agentCustomBudget = Agent(
'anthropic:claude-sonnet-4-5',
enableThinking: true,
chatModelOptions: const AnthropicChatOptions(
thinkingBudgetTokens: 8192, // Optional: override default budget
),
);SDK Support: anthropic_sdk_dart v0.3.0+ includes full thinking support
Content Blocks: Anthropic includes thinking as Block.thinking() in message content:
// Anthropic's native format includes thinking in content
Block.thinking(
type: ThinkingBlockType.thinking,
thinking: "Let me think through this step by step...",
signature: "optional_cryptographic_signature",
cacheControl: null,
)Streaming Events: BlockDelta.thinking()
// In MessageStreamEventTransformer
BlockDelta.thinking(
thinking: "Step 1: Analyze the problem...",
type: ThinkingBlockDeltaType.thinkingDelta,
)Dartantic Mapping Strategy:
Despite Anthropic including thinking in message content, Dartantic follows the established pattern with one important exception:
-
During streaming: Extract thinking deltas and emit as
ThinkingPartinstances - After completion: Accumulate full thinking in result metadata
- In message history: Thinking blocks are preserved in metadata when tool calls are present
Implementation Flow:
The Anthropic message mapper handles thinking through the following high-level flow:
-
Streaming Phase:
- Accumulate thinking deltas in a buffer as
ThinkingBlockDeltaevents arrive - Emit each delta as
ThinkingPartfor real-time display - Capture the cryptographic signature from the
ThinkingBlock.startevent
- Accumulate thinking deltas in a buffer as
-
Completion Phase:
- Store accumulated thinking text and signature in message metadata
- When tool calls are present, preserve the complete thinking block data
- Filter thinking blocks from message parts for regular (non-tool) responses
-
History Reconstruction:
- When sending messages back to Anthropic, check for thinking block metadata
- If present and tool calls exist, reconstruct the
Block.thinking()with original signature - Place thinking block before tool_use blocks (required by Anthropic's API)
See anthropic_message_mappers.dart for the complete implementation.
Token Accounting:
- Thinking tokens count toward
max_tokenslimit - User charged for full thinking tokens generated (not the summary)
- Explicit budget control via
budgetTokensparameter - Minimum budget: 1,024 tokens
Signature: Anthropic provides optional cryptographic signature for authenticity verification. Stored in message metadata via AnthropicThinkingMetadata with key _anthropic_thinking_signature (a string)
Token Budget Configuration:
- Default: 4096 tokens
- Optional override via
AnthropicChatOptions.thinkingBudgetTokens - Minimum: 1,024 tokens
- Maximum: Less than
maxTokens - Anthropic recommends 4k-10k for most tasks, scaling up for complex reasoning
Google's Gemini API supports extended thinking through the thinkingConfig parameter with explicit token budget control and dynamic thinking modes. Gemini 3 and later also support thinking levels (GoogleThinkingLevel: minimal, low, medium, high) via GoogleChatModelOptions.thinkingLevel. The API does not allow thinkingBudgetTokens and thinkingLevel in the same request; choose one. Set thinkingLevel when you need depth presets; keep using enableThinking: true if you also want thought summaries as ThinkingPart in addition to level control.
import 'package:dartantic_ai/dartantic_ai.dart';
// Simple configuration - enable thinking with dynamic budget (model decides)
final agent = Agent(
'google:gemini-2.5-flash',
enableThinking: true,
);
// Advanced configuration - customize token budget
final agentCustomBudget = Agent(
'google:gemini-2.5-flash',
enableThinking: true,
chatModelOptions: const GoogleChatModelOptions(
thinkingBudgetTokens: 8192, // Optional: override dynamic default
),
);
// Explicit dynamic thinking (model decides optimal budget)
final agentDynamic = Agent(
'google:gemini-2.5-flash',
enableThinking: true,
chatModelOptions: const GoogleChatModelOptions(
thinkingBudgetTokens: -1, // Explicit dynamic mode
),
);SDK Support: googleai_dart v5.0.0+ includes full thinking support
Content Parts: Google includes thinking as regular text parts with a thought boolean flag:
// Google's native format marks thinking with flag
Part(
text: "Let me think through this step by step...",
thought: true,
)Dartantic Mapping Strategy:
Despite Google including thinking in content parts, Dartantic follows the established pattern:
-
During streaming: Extract thinking text from parts where
thought=trueand emit asThinkingPart - After completion: Accumulate full thinking in result metadata
- In message history: Thinking is NOT included in message parts (filtered during mapping)
Implementation Flow:
The Google message mapper handles thinking through the following flow:
-
Response Processing:
- Check each part's
thoughtflag - If
thought=true, accumulate text in thinking buffer - If
thought=false, add text as normal TextPart
- Check each part's
-
Metadata Addition:
- Store accumulated thinking as
ThinkingPartin message parts - Thinking never appears in
ChatMessage.parts
- Store accumulated thinking as
See google_message_mappers.dart for the complete implementation.
Token Accounting:
- Thinking tokens count toward
maxOutputTokenslimit - User charged for full thinking tokens generated
- Explicit budget control via
thinkingBudgetTokensparameter - Budget ranges vary by model (see configuration section)
Token Budget Options:
- Gemini 2.5 Pro: 128-32768 tokens (default: dynamic)
- Gemini 2.5 Flash: 0-24576 tokens (default: dynamic)
- Gemini 2.5 Flash-Lite: 512-24576 tokens (no default)
- Dynamic Mode (-1): Model determines optimal budget based on task complexity
Thought Signatures: Google provides optional encrypted signatures for thinking blocks to maintain context across multi-turn conversations with function calling. These are stored in message metadata via GoogleThinkingMetadata with key _google_thought_signatures (a map of tool call ID → byte array as List<int>).
Token Budget Configuration:
- Default: Dynamic (-1, model decides optimal budget)
- Optional override via
GoogleChatModelOptions.thinkingBudgetTokens - Ranges vary by model:
- Gemini 2.5 Pro: 128-32768 tokens
- Gemini 2.5 Flash: 0-24576 tokens
- Gemini 2.5 Flash-Lite: 512-24576 tokens
- Set to -1 for dynamic thinking (recommended)
- Set to 0 to disable thinking
Thinking levels (Gemini 3+):
- Optional
GoogleChatModelOptions.thinkingLevelwhen the model uses levels instead of (or in addition to) token budget - Mutually exclusive with
thinkingBudgetTokensin a single request
| Feature | OpenAI Responses | Anthropic | |
|---|---|---|---|
| Enable Method | Agent enableThinking parameter |
Agent enableThinking parameter |
Agent enableThinking parameter |
| ChatModel Storage | No (merged into options at Provider level) | Yes (stored in ChatModel field) | Yes (stored in ChatModel field) |
| Fine-Tuning Options |
reasoningSummary, reasoningEffort
|
thinkingBudgetTokens |
thinkingBudgetTokens or thinkingLevel (Gemini 3+), not both |
| Default Behavior |
reasoningSummary: detailed when enabled |
4096 token budget | Dynamic (-1, model decides) |
| Token Budget Control | No (model-controlled) | Yes (4096 default, min 1024) | Yes (dynamic default, model-specific ranges) |
| Dynamic Budget | No | No | Yes (-1 for model-determined) |
| Token Accounting | Separate reasoning budget | Counts toward max_tokens
|
Counts toward maxOutputTokens
|
| Streaming Event | ResponseReasoningSummaryTextDelta |
BlockDelta.thinking() |
Text parts with thought=true
|
| Content Block | No (metadata only) | Yes (Block.thinking()) |
Yes (Part with thought flag) |
| Signature | No | Yes (optional cryptographic) | Yes (optional encrypted) |
| Dartantic Representation |
ThinkingPart in message parts |
ThinkingPart + metadata for tool replay |
ThinkingPart in message parts |
| Message History | Never included | Preserved when tool calls present | Never included |
| Tool Use Compatibility | Full support | Full support (thinking auto-preserved) | Full support |
| Temperature Constraints | None | Cannot use with modified temperature | None |
| Top-K Constraints | None | Cannot use | None |
| Top-P Constraints | None | Limited to 0.95-1.0 range | None |
In v3.0.0+, streaming thinking is available via the dedicated chunk.thinking field on ChatResult<String>. This provides symmetric access to thinking during streaming, matching how chunk.output provides streaming text:
await for (final chunk in agent.sendStream(prompt)) {
// Real-time thinking display
if (chunk.thinking != null) {
stdout.write(chunk.thinking);
}
// Real-time text display
stdout.write(chunk.output);
// Consolidated messages for history
history.addAll(chunk.messages);
}The final consolidated message contains a single ThinkingPart with the complete thinking text for storage in conversation history.
In v3.0.0+, thinking is accessible via result.thinking for non-streaming operations:
final result = await agent.send(prompt);
// Access thinking via result.thinking
if (result.thinking != null) {
print('[[${result.thinking}]]');
}
print(result.output);Thinking is also stored as ThinkingPart in the consolidated message for history storage.
All providers that support thinking (OpenAI Responses, Anthropic, Google) use the same API: result.thinking for both streaming and non-streaming access. Thinking content is also stored as ThinkingPart in consolidated messages for history.
Provider-specific configuration:
-
OpenAI Responses: Enable with
enableThinking: true, usesreasoningSummary: detailedby default -
Anthropic: Enable with
enableThinking: true, uses 4096 token budget by default -
Google: Enable with
enableThinking: true, uses dynamic budget (-1) by default
To compare thinking across providers, create agents with enableThinking: true and filter streamed message parts for ThinkingPart instances.
Tests should cover the following functional areas across all thinking-enabled providers:
Unit Tests:
- Provider
createChatModel()acceptsenableThinkingparameter - Provider passes
enableThinkingflag to ChatModel constructor (Anthropic, Google) - Provider merges thinking config into options (OpenAI Responses)
- Options classes properly handle
thinkingBudgetTokensparameter - Message mappers build provider-specific thinking config from
enableThinking+ options - Streaming transformers extract thinking deltas from provider-specific events
- Thinking accumulation in buffers during streaming
- Thinking blocks filtered from
ChatMessage.parts - Final message parts contain complete accumulated thinking as
ThinkingPartinstances
Integration Tests:
- End-to-end streaming with thinking enabled
- Non-streaming thinking in result metadata
- Thinking NOT included in conversation history (except Anthropic with tools)
- Thinking with tool calls
- Provider thinking support (test infrastructure uses
ProviderTestCaps.thinking) - Unsupported providers throw
UnsupportedErrorwhenenableThinking=true
Test Organization:
- Use capability-based test filtering to run tests only against supporting providers
- Test both simple configuration (default budgets) and advanced configuration (custom budgets/options)
- Verify provider-specific behavior (e.g., Anthropic's thinking block preservation with tools)
When implementing thinking support for a new provider:
-
Declare Capability for Tests
- Add
ProviderTestCaps.thinkingto the provider's entry inproviderTestCaps - This enables capability-based test filtering in the test infrastructure
- Add
-
Accept
enableThinkingParameter- Add
bool enableThinking = falseparameter toProvider.createChatModel() - This is the standard interface all providers must implement
- Add
-
Choose Implementation Pattern
Option A: Store in ChatModel (Anthropic/Google pattern)
- Pass
enableThinkingto ChatModel constructor - Store as private field (e.g.,
_enableThinking) - Use when building provider-specific API requests
Option B: Merge into Options (OpenAI Responses pattern)
- Transform
enableThinkinginto provider-specific options at Provider level - Pass merged options to ChatModel
- Use when provider has complex thinking configuration
- Pass
-
Add Provider-Specific Options
- Add optional fine-tuning parameters to ChatOptions class
- Examples:
thinkingBudgetTokens,reasoningEffort,reasoningSummary - Document that these only apply when
enableThinking=true
-
Extract Thinking from Streaming Events
- Map provider-specific thinking events to
ThinkingPartinstances in message parts - Emit each delta immediately during streaming
- Accumulate deltas in a buffer for final result
- Map provider-specific thinking events to
-
Filter Thinking from Message Parts
- Thinking must NOT appear in
ChatMessage.parts - Exception: Provider-specific requirements (e.g., Anthropic with tools)
- Store any required replay data in message metadata
- Thinking must NOT appear in
-
Handle Tool Call Integration
- Verify thinking works correctly with tool calls
- Preserve thinking context if required by provider API
- Document any special behavior
⚠️ CRITICAL: ThinkingPart MUST NEVER Be Sent to LLMsThinking content is model-generated output only. It flows FROM the model, NEVER TO it. Every message mapper includes an assertion that fails if ThinkingPart appears in outbound messages. This is a fundamental architectural constraint:
- ThinkingPart is received from the model's reasoning process
- It is stored for display/debugging purposes
- It is NEVER sent back in conversation history
- If ThinkingPart appears in outbound messages, it indicates a bug in the message pipeline
The only exception is Anthropic's signature-preserved thinking blocks for tool call continuity, which are handled via special metadata (
_anthropic_thinking_signature), NOT via ThinkingPart.
⚠️ CRITICAL: AgentResponseAccumulator Filtering RuleThe
AgentResponseAccumulatorfilters ONLY streaming-only ThinkingPart messages (messages where ALL parts are ThinkingPart and nothing else). These are:
- Emitted during streaming for real-time display
- Duplicated in the consolidated model message
- NOT needed for mappers (they use the consolidated message)
The consolidated message (ThinkingPart + TextPart/ToolPart, with signature metadata) MUST be preserved because provider mappers (e.g., Anthropic) need it for multi-turn tool calling. Empty model messages (no parts) must also pass through.
-
Agent-Level Configuration: Thinking is enabled via
Agent(enableThinking: true), not in provider-specific options - Never send thinking back to model: ThinkingPart MUST NEVER appear in outbound messages (enforced by assertions in all mappers)
-
Always emit thinking via ThinkingPart: Use
ThinkingPartinstances in message parts, not custom metadata keys - Always accumulate thinking: Provide full thinking text in final result
- Single-item events during streaming: Each thinking delta is a separate chunk
-
Provider-agnostic surface: Same
ThinkingParttype in message parts across all providers
Provider implementations should validate thinking configuration:
-
Unsupported provider: Throw
UnsupportedErrorincreateChatModel()whenenableThinking=truefor providers that don't support thinking - Budget constraints: Validate token budgets against provider-specific minimums and maximums
- Invalid parameter combinations: Enforce provider-specific restrictions (e.g., Anthropic's temperature constraints)
- Clear error messages: Include which providers DO support thinking in error messages
- Thinking increases latency: Models take longer to respond when thinking
- Thinking consumes tokens: Budget for additional token costs
- Streaming benefits: Users see progress as thinking streams
- Buffer management: Use StringBuffer for efficient accumulation
- Server-Side-Tools-Tech-Design - Similar metadata pattern for server-side tools
- Message-Handling-Architecture - Message part design and filtering
- Orchestration-Layer-Architecture - Streaming state management
- Provider-Implementation-Guide - General provider implementation patterns