Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions Documentation/OpenAI/RealtimeSchemaMatrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Realtime API Schema Matrix

This matrix maps the current OpenAI Realtime `session.update.session` and `response.create.response`
fields to AIProxySwift types and wire encoding behavior.

Reference: https://developers.openai.com/api/reference/resources/realtime

## Shared Realtime Session

These fields are used by Performance Realtime models, such as `gpt-realtime-1.5`, and are also the
base session shape composed by Realtime Reasoning models.

| Wire field | AIProxySwift API | Wire shape emitted |
| --- | --- | --- |
| `type` | `OpenAIRealtimeSessionConfiguration.type` | string |
| `include` | `OpenAIRealtimeSessionConfiguration.include` | string array |
| `model` | `OpenAIRealtimeSessionConfiguration.model` | string |
| `instructions` | `OpenAIRealtimeSessionConfiguration.instructions` | string |
| `max_output_tokens` | `OpenAIRealtimeSessionConfiguration.maxOutputTokens` | int or `"inf"` |
| `output_modalities` | `OpenAIRealtimeSessionConfiguration.outputModalities` | enum string array |
| `prompt` | `OpenAIRealtimeSessionConfiguration.prompt` | object (`id`, optional `variables`, optional `version`) |
| `tracing` | `OpenAIRealtimeSessionConfiguration.tracing` | string `"auto"` or object (`group_id`, `metadata`, `workflow_name`) |
| `truncation` | `OpenAIRealtimeSessionConfiguration.truncation` | string (`"auto"`/`"disabled"`) or retention-ratio object |
| `tools` | `OpenAIRealtimeSessionConfiguration.tools` | union array (`function`, `mcp`, `web_search`) |
| `tool_choice` | `OpenAIRealtimeSessionConfiguration.toolChoice` | string (`auto`/`none`/`required`) or typed selector object |
| `audio.input.format` | `OpenAIRealtimeSessionConfiguration.inputAudioFormat` | object (`type`, optional `rate`) |
| `audio.input.noise_reduction` | `OpenAIRealtimeSessionConfiguration.inputAudioNoiseReduction` | object (`type`) |
| `audio.input.transcription` | `OpenAIRealtimeSessionConfiguration.inputAudioTranscription` | object (`language`, `model`, `prompt`) |
| `audio.input.turn_detection` | `OpenAIRealtimeSessionConfiguration.turnDetection` | typed object union (`server_vad` / `semantic_vad`) |
| `audio.output.format` | `OpenAIRealtimeSessionConfiguration.outputAudioFormat` | object (`type`, optional `rate`) |
| `audio.output.speed` | `OpenAIRealtimeSessionConfiguration.speed` | number (range 0.25...1.5) |
| `audio.output.voice` | `OpenAIRealtimeSessionConfiguration.voice` | string or object (`id`) |

## Realtime Reasoning Session

Realtime Reasoning models, such as `gpt-realtime-2`, compose the shared session fields above and add
Reasoning-only fields to the same `session.update.session` object.

| Wire field | AIProxySwift API | Wire shape emitted |
| --- | --- | --- |
| `reasoning` | `OpenAIRealtimeReasoningSessionConfiguration.reasoning` | object |
| `reasoning.effort` | `OpenAIRealtimeReasoningConfiguration.effort` | `minimal`, `low`, `medium`, `high`, or `xhigh` |
| `parallel_tool_calls` | `OpenAIRealtimeReasoningSessionConfiguration.parallelToolCalls` | boolean |

## Shared `response.create`

| Wire field | AIProxySwift API | Wire shape emitted |
| --- | --- | --- |
| `type` | `OpenAIRealtimeResponseCreate.type` | `"response.create"` |
| `event_id` | `OpenAIRealtimeResponseCreate.eventID` | optional string |
| `response.instructions` | `OpenAIRealtimeResponseCreate.Response.instructions` | optional string |
| `response.output_modalities` | `OpenAIRealtimeResponseCreate.Response.outputModalities` | optional enum string array |
| `response.tools` | `OpenAIRealtimeResponseCreate.Response.tools` | optional tool union array (`function`, `mcp`, `web_search`) |
| `response.tool_choice` | `OpenAIRealtimeResponseCreate.Response.toolChoice` | optional string/object union |

## Realtime Reasoning `response.create`

| Wire field | AIProxySwift API | Wire shape emitted |
| --- | --- | --- |
| `type` | `OpenAIRealtimeReasoningResponseCreate.type` | `"response.create"` |
| `event_id` | `OpenAIRealtimeReasoningResponseCreate.eventID` | optional string |
| `response.reasoning` | `OpenAIRealtimeReasoningResponseCreate.Response.reasoning` | object |
| `response.reasoning.effort` | `OpenAIRealtimeReasoningConfiguration.effort` | `minimal`, `low`, `medium`, `high`, or `xhigh` |
| `response.parallel_tool_calls` | `OpenAIRealtimeReasoningResponseCreate.Response.parallelToolCalls` | boolean |

## Realtime Reasoning Output Phases

Realtime Reasoning output can be split into commentary and final answer phases.

| Wire field | AIProxySwift API | Wire shape decoded |
| --- | --- | --- |
| `response.output[].phase` | `OpenAIRealtimeResponseOutputItem.phase` | `commentary` or `final_answer` |
| `response.output_item.*.item.phase` | `OpenAIRealtimeResponseOutputItemAddedEvent.phase` / `OpenAIRealtimeResponseOutputItemDoneEvent.phase` | `commentary` or `final_answer` |
| `conversation.item.*.item.phase` | `OpenAIRealtimeConversationItemCreatedEvent.phase` | `commentary` or `final_answer` |

## `conversation.item.create`

Reference: https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/create

| Wire field | AIProxySwift API | Wire shape emitted |
| --- | --- | --- |
| `type` | `OpenAIRealtimeConversationItemCreate.type` | `"conversation.item.create"` |
| `item.type` | `OpenAIRealtimeConversationItemCreate.Item` | `"message"`, `"function_call"`, `"function_call_output"` |
| `item.role` | `OpenAIRealtimeConversationItemCreate.Item.role` | optional string for message items |
| `item.content[].type` | `OpenAIRealtimeConversationItemCreate.Item.Content.type` | `input_text`, `output_text`, `input_audio`, `item_reference`, `input_image` |
| `item.content[].text` | `OpenAIRealtimeConversationItemCreate.Item.Content.text` | optional string |
| `item.content[].audio` | `OpenAIRealtimeConversationItemCreate.Item.Content.audio` | optional string |
| `item.content[].item_id` | `OpenAIRealtimeConversationItemCreate.Item.Content.itemID` | optional string |
| `item.call_id` | `OpenAIRealtimeConversationItemCreate.Item.callID` | optional string |
| `item.name` | `OpenAIRealtimeConversationItemCreate.Item.name` | optional string |
| `item.arguments` | `OpenAIRealtimeConversationItemCreate.Item.arguments` | optional string |
| `item.output` | `OpenAIRealtimeConversationItemCreate.Item.output` | optional string |
74 changes: 63 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1384,13 +1384,10 @@ final class RealtimeManager {
inputAudioFormat: .pcm16,
inputAudioTranscription: .init(model: "whisper-1"),
instructions: "You are a tour guide of Yosemite national park",
maxResponseOutputTokens: .int(4096),
modalities: [.audio],
maxOutputTokens: .int(4096),
outputModalities: [.audio],
outputAudioFormat: .pcm16,
temperature: 0.7,
turnDetection: .init(
type: .semanticVAD(eagerness: .medium)
),
turnDetection: .semanticVAD(.init(eagerness: .medium)),
voice: "shimmer"
)

Expand Down Expand Up @@ -1449,14 +1446,15 @@ final class RealtimeManager {
}
```

#### General Availability (GA) Realtime migration notes
#### Current Realtime API notes

- OpenAI has announced Realtime beta (`OpenAI-Beta: realtime=v1`) deprecation and shutdown on 2026-05-07.
- For `response.create`, GA uses `output_modalities` (not `modalities`).
- The new `output_modalities` for OpenAI realtime GA (general availability) is as follows:
- For a field-by-field mapping of the Realtime wire shape to AIProxySwift types, see
[Realtime schema matrix](Documentation/OpenAI/RealtimeSchemaMatrix.md).
- For `response.create`, the current Realtime API uses `output_modalities` (not `modalities`).
- `output_modalities` is as follows:
- `["audio"]` returns audio with transcript.
- `["text"]` returns text only.
- For voice mode with built-in web search, use GA tool (`.webSearch`) and specify `.auto` for toolChoice to let the model decide when to use it.
- For voice mode with built-in web search, use the `.webSearch` tool and specify `.auto` for `toolChoice` to let the model decide when to use it.

```swift
let configuration = OpenAIRealtimeSessionConfiguration(
Expand All @@ -1473,6 +1471,60 @@ let session = try await openAIService.realtimeSession(
)
```

#### Realtime Reasoning models

OpenAI's Realtime Reasoning models, such as `gpt-realtime-2`, use the same Realtime WebSocket
transport and shared session fields as Performance models like `gpt-realtime-1.5`, plus
Reasoning-only configuration for effort and parallel tool calls.

```swift
let configuration = OpenAIRealtimeReasoningSessionConfiguration(
session: OpenAIRealtimeSessionConfiguration(
outputModalities: [.audio],
voice: .builtin("alloy"),
tools: [.webSearch(.init(searchContextSize: .medium))],
toolChoice: .auto
),
reasoning: .init(effort: .low),
parallelToolCalls: true
)

let session = try await openAIService.realtimeSession(
model: "gpt-realtime-2",
configuration: configuration,
logLevel: .info
)
```

You can also override Reasoning settings for a single response:

```swift
await session.sendMessage(
OpenAIRealtimeReasoningResponseCreate(
response: .init(
base: .init(
instructions: "Use the lowest sufficient reasoning effort.",
outputModalities: [.audio]
),
reasoning: .init(effort: .minimal),
parallelToolCalls: false
)
)
)
```

Realtime Reasoning responses can include phased output. Use `phase` to separate short commentary
from the final answer when the model emits both in a turn:

```swift
for await message in session.receiver {
if case .responseDone(let event) = message {
let commentary = event.output?.filter { $0.phase == .commentary }
let finalAnswer = event.output?.filter { $0.phase == .finalAnswer }
}
}
```

### How to make a basic request using OpenAI's Responses API
Note: there is also a streaming version of this snippet below.

Expand Down
42 changes: 42 additions & 0 deletions Sources/AIProxy/OpenAI/OpenAIRealtimeMessage.swift
Original file line number Diff line number Diff line change
Expand Up @@ -277,15 +277,46 @@ public struct OpenAIRealtimeInputAudioBufferDTMFEventReceivedEvent: Decodable, S
}
}

public enum OpenAIRealtimeResponsePhase: String, Decodable, Sendable {
case commentary
case finalAnswer = "final_answer"
}

public struct OpenAIRealtimeResponseOutputItem: Decodable, Sendable {
public let id: String?
public let phase: OpenAIRealtimeResponsePhase?
public let content: [Content]?

public var transcript: String? {
content?.first(where: { ($0.transcript?.isEmpty == false) })?.transcript
}

private enum CodingKeys: String, CodingKey {
case id
case phase
case content
}
}

extension OpenAIRealtimeResponseOutputItem {
public struct Content: Decodable, Sendable {
public let type: String?
public let text: String?
public let transcript: String?
}
}

public struct OpenAIRealtimeConversationItemCreatedEvent: Decodable, Sendable {
public let itemID: String?
public let previousItemID: String?
public let role: String?
public let phase: OpenAIRealtimeResponsePhase?
public let eventID: String?

private struct ItemBody: Decodable {
let id: String?
let role: String?
let phase: OpenAIRealtimeResponsePhase?
}

private enum CodingKeys: String, CodingKey {
Expand All @@ -302,6 +333,7 @@ public struct OpenAIRealtimeConversationItemCreatedEvent: Decodable, Sendable {
self.itemID = item?.id ?? fallbackItemID
self.previousItemID = try container.decodeIfPresent(String.self, forKey: .previousItemID)
self.role = item?.role
self.phase = item?.phase
self.eventID = try container.decodeIfPresent(String.self, forKey: .eventID)
}
}
Expand All @@ -325,10 +357,12 @@ public struct OpenAIRealtimeResponseOutputItemAddedEvent: Decodable, Sendable {
public let responseID: String?
public let itemID: String?
public let outputIndex: Int?
public let phase: OpenAIRealtimeResponsePhase?
public let eventID: String?

private struct ItemBody: Decodable {
let id: String?
let phase: OpenAIRealtimeResponsePhase?
}

private enum CodingKeys: String, CodingKey {
Expand All @@ -346,6 +380,7 @@ public struct OpenAIRealtimeResponseOutputItemAddedEvent: Decodable, Sendable {
let fallbackItemID = try container.decodeIfPresent(String.self, forKey: .itemID)
self.itemID = item?.id ?? fallbackItemID
self.outputIndex = container.decodeFlexibleIntIfPresent(forKey: .outputIndex)
self.phase = item?.phase
self.eventID = try container.decodeIfPresent(String.self, forKey: .eventID)
}
}
Expand All @@ -354,6 +389,7 @@ public struct OpenAIRealtimeResponseOutputItemDoneEvent: Decodable, Sendable {
public let responseID: String?
public let itemID: String?
public let outputIndex: Int?
public let phase: OpenAIRealtimeResponsePhase?
public let transcript: String?
public let eventID: String?

Expand All @@ -362,6 +398,7 @@ public struct OpenAIRealtimeResponseOutputItemDoneEvent: Decodable, Sendable {
let transcript: String?
}
let id: String?
let phase: OpenAIRealtimeResponsePhase?
let content: [ContentBody]?
}

Expand All @@ -380,6 +417,7 @@ public struct OpenAIRealtimeResponseOutputItemDoneEvent: Decodable, Sendable {
let fallbackItemID = try container.decodeIfPresent(String.self, forKey: .itemID)
self.itemID = item?.id ?? fallbackItemID
self.outputIndex = container.decodeFlexibleIntIfPresent(forKey: .outputIndex)
self.phase = item?.phase
self.transcript = item?.content?.first(where: { ($0.transcript?.isEmpty == false) })?.transcript
self.eventID = try container.decodeIfPresent(String.self, forKey: .eventID)
}
Expand Down Expand Up @@ -473,19 +511,22 @@ public struct OpenAIRealtimeResponseDoneEvent: Decodable, Sendable {
public let responseID: String?
public let conversationID: String?
public let status: String?
public let output: [OpenAIRealtimeResponseOutputItem]?
public let usage: OpenAIRealtimeResponseUsage?
public let eventID: String?

private struct ResponseBody: Decodable {
let id: String?
let conversationID: String?
let status: String?
let output: [OpenAIRealtimeResponseOutputItem]?
let usage: OpenAIRealtimeResponseUsage?

private enum CodingKeys: String, CodingKey {
case id
case conversationID = "conversation_id"
case status
case output
case usage
}
}
Expand All @@ -503,6 +544,7 @@ public struct OpenAIRealtimeResponseDoneEvent: Decodable, Sendable {
self.responseID = response?.id ?? fallbackResponseID
self.conversationID = response?.conversationID
self.status = response?.status
self.output = response?.output
self.usage = response?.usage
self.eventID = try container.decodeIfPresent(String.self, forKey: .eventID)
}
Expand Down
24 changes: 24 additions & 0 deletions Sources/AIProxy/OpenAI/OpenAIRealtimeReasoningConfiguration.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
//
// OpenAIRealtimeReasoningConfiguration.swift
// AIProxy
//

/// Configuration for OpenAI Realtime Reasoning models such as `gpt-realtime-2`.
nonisolated public struct OpenAIRealtimeReasoningConfiguration: Encodable, Sendable {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing I'd like to understand before merging is if we need this separate ReasoningConfiguration, and separate initializer in OpenAIRealtimeSession. IIUC, a more surgical change would be to modify OpenAIRealtimeSessionConfiguration by adding a member: let reasoning: OpenAIRealtimeReasoning?.

The OpenAIRealtimeReasoning type would have a single member, effort, much like your current type OpenAIRealtimeReasoningConfiguration.

I don't see any real control flow or network sequencing differences between reasoning and non-reasoning versions right now, so I think this would be a simpler change. Let me know if I'm missing something @richarddas

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a nit: For any new types that you do create, can you use one file per public type and pull them into a new folder OpenAI/Realtime (you can see the existing example of OpenAI/Conversations). I want to start organizing up realtime files for our eventual split of this repo into several single purpose clients. That will make the work down the road a bit easier

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thinking was around keeping Performance and Reasoning explicit at the callsite, but you make a solid point that the rest of the sequencing collapses the two anyway. Since models are provided as strings, the wrapper also doesn’t actually enforce that gpt-realtime-2 uses the Reasoning config. So the wrapper is probably overkill.

I’ll fold reasoning and parallelToolCalls into the existing session and response-create types, while keeping reasoning as a grouped value so the Reasoning intent is still explicit at the callsite.

/// Constrains effort on Realtime Reasoning models.
public let effort: Effort?

public init(effort: Effort? = nil) {
self.effort = effort
}
}

extension OpenAIRealtimeReasoningConfiguration {
nonisolated public enum Effort: String, Encodable, Sendable {
case minimal
case low
case medium
case high
case xhigh
}
}
Loading