From a679914727d8e97a4f6ba482578b1c1d6b822c2f Mon Sep 17 00:00:00 2001 From: Guruprasad Bangalore Hiriyannaiah Date: Thu, 5 Feb 2026 11:50:40 +0530 Subject: [PATCH 1/7] Extend Activity Schema to Support Multimodal Interactions with Streaming This PR implements the approved proposal from issue #416 to extend the Activity Protocol schema for multimodal interactions with streaming support for voice/audio. Changes: - Added Reserved Events for Media Streaming (Media.Start, Media.Chunk, Media.End, Voice.Message) - Extended streaminfo entity to support media streaming with streamState property - Added Session Lifecycle Commands (session.init, session.update, session.end) for multimodal interactions - Bumped version to Provisional 3.4 Key design decisions (per AP Core Committee): - No new activity types - uses existing event, command, commandResult - No new schema fields - uses existing value, valueType, entities - 100% backward compatible - Uses streamInfo entity for stream metadata and sequencing - Uses Media.* prefix for media streaming events Related: #416 --- specs/activity/protocol-activity.md | 359 +++++++++++++++++++++++++++- 1 file changed, 356 insertions(+), 3 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 2ed2e701..de75bba9 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -1,6 +1,6 @@ # Activity Protocol -- Activity -Version: Provisional 3.3 +Version: Provisional 3.4 ## Abstract @@ -638,6 +638,185 @@ Possible values for `contentType` are audio, video, text, screen, all or any oth } ``` +### Reserved Events for Media Streaming + +Media streaming events are used to facilitate real-time multimodal interactions, particularly for voice and audio streaming. These events use the `Media.*` prefix and work in conjunction with the [`streamInfo`](#streaminfo) entity for stream metadata and sequencing. + +`A5210`: Media streaming events MUST use the `Media.*` prefix for their `name` field. + +`A5211`: Media streaming events SHOULD include a [`streamInfo`](#streaminfo) entity to convey stream metadata. + +`A5212`: Media streaming events MAY use the `value` and `valueType` fields to carry modality-specific content. + +#### Media.Start + +The `Media.Start` event initiates a media streaming session. It establishes the stream context and media type that will be transmitted. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"event"` | +| `name` | string | Yes | Must be `"Media.Start"` | +| `valueType` | string | No | Identifies the schema of the `value` object, e.g., `"application/vnd.microsoft.activity.mediastart+json"` | +| `value` | object | No | Contains media type and content type information | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamState` of `"streaming"` | + +Example: +```json +{ + "type": "event", + "name": "Media.Start", + "valueType": "application/vnd.microsoft.activity.mediastart+json", + "value": { + "mediaType": "audio", + "contentType": "audio/webm" + }, + "entities": [ + { + "type": "streamInfo", + "streamId": "abc123", + "streamType": "audio", + "streamState": "streaming", + "streamSequence": 1 + } + ] +} +``` + +`A5220`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.Start` events with a valid `streamId`. + +`A5221`: The `streamSequence` in `Media.Start` SHOULD be `1` as it initiates the stream. + +#### Media.Chunk + +The `Media.Chunk` event sends a chunk of media data during an active streaming session. Chunks are sequenced using the `streamSequence` field in the [`streamInfo`](#streaminfo) entity. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"event"` | +| `name` | string | Yes | Must be `"Media.Chunk"` | +| `valueType` | string | No | Identifies the schema of the `value` object, e.g., `"application/vnd.microsoft.activity.audiochunk+json"` | +| `value` | object | Yes | Contains the media chunk data | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity | + +The `value` object for audio chunks typically includes: + +| Property | Type | Required | Description | +|----------------|---------|----------|------------------------------------------------| +| `contentType` | string | Yes | MIME type of the media, e.g., `"audio/webm"` | +| `contentUrl` | string | Yes | Data URI containing Base64-encoded media data | +| `durationMs` | integer | No | Duration of the chunk in milliseconds | +| `timestamp` | string | No | ISO 8601 timestamp of the chunk | +| `transcription`| string | No | Optional real-time transcription of audio | + +Example: +```json +{ + "type": "event", + "name": "Media.Chunk", + "valueType": "application/vnd.microsoft.activity.audiochunk+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,...", + "durationMs": 2500, + "timestamp": "2025-10-07T10:30:05Z", + "transcription": "Your destination?" + }, + "entities": [ + { + "type": "streamInfo", + "streamId": "abc123", + "streamType": "audio", + "streamState": "streaming", + "streamSequence": 2 + } + ] +} +``` + +`A5230`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.Chunk` events with the same `streamId` as the corresponding `Media.Start`. + +`A5231`: The `streamSequence` MUST be incrementing for each chunk within the same stream. + +`A5232`: Receivers SHOULD use `streamSequence` to order chunks and detect missing chunks. + +#### Media.End + +The `Media.End` event signals the end of a media streaming session. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"event"` | +| `name` | string | Yes | Must be `"Media.End"` | +| `valueType` | string | No | Identifies the schema, e.g., `"application/vnd.microsoft.activity.mediaend+json"` | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamState` of `"final"` | + +Example: +```json +{ + "type": "event", + "name": "Media.End", + "valueType": "application/vnd.microsoft.activity.mediaend+json", + "entities": [ + { + "type": "streamInfo", + "streamId": "abc123", + "streamType": "audio", + "streamState": "final", + "streamSequence": 3 + } + ] +} +``` + +`A5240`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.End` events with `streamState` set to `"final"`. + +`A5241`: Receivers SHOULD clean up stream resources upon receiving `Media.End`. + +#### Voice.Message + +The `Voice.Message` event delivers a complete voice message, either as a final response after streaming or as a standalone voice message. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"event"` | +| `name` | string | Yes | Must be `"Voice.Message"` | +| `valueType` | string | Yes | Must be `"application/vnd.microsoft.activity.voice+json"` | +| `value` | object | Yes | Contains the voice message content | + +The `value` object for voice messages includes: + +| Property | Type | Required | Description | +|----------------|---------|----------|------------------------------------------------| +| `contentType` | string | Yes | MIME type of the audio, e.g., `"audio/webm"` | +| `contentUrl` | string | Yes | Data URI or URL containing the audio data | +| `transcription`| string | No | Text transcription of the audio | +| `durationMs` | integer | No | Duration in milliseconds | +| `timestamp` | string | No | ISO 8601 timestamp | +| `locale` | string | No | Language/locale of the audio, e.g., `"en-US"` | + +Example: +```json +{ + "type": "event", + "name": "Voice.Message", + "valueType": "application/vnd.microsoft.activity.voice+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,...", + "transcription": "Book a flight to Paris", + "durationMs": 3400, + "timestamp": "2025-10-07T10:30:00Z", + "locale": "en-US" + } +} +``` + +`A5250`: `Voice.Message` events MUST include a `valueType` of `"application/vnd.microsoft.activity.voice+json"`. + +`A5251`: The `value` object MUST include `contentType` and `contentUrl` fields. + +`A5252`: Senders SHOULD include a `transcription` field to support accessibility and text-based processing. + ## Invoke activity @@ -1594,6 +1773,14 @@ The `error` field contains the reason the original [command activity](#command-a # Appendix I - Changes +# 2025-02-05 - guhiriya@microsoft.com +* Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`, `Voice.Message`) +* Extended `streaminfo` entity to support media streaming with `streamState` property +* Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal interactions +* Added normative requirements A5210-A5252 for media streaming events +* Added normative requirements A9260-A9263 for media streaming in streaminfo +* Added normative requirements A9400-A9442 for session lifecycle commands + # 2025-09-30 - mattb-msft * Updated Channel Account definition to reflect current rules and usages. @@ -1764,16 +1951,21 @@ Note that on channels with a persistent chat feed, `platform` is typically usefu ### streaminfo -The `streaminfo` entity conveys metadata supporting chunked streaming of text messages, typically sent as a sequence of `typing` Activities, followed by a final `message` Activity containing the complete text. +The `streaminfo` entity conveys metadata supporting chunked streaming of messages. It is used for: +- **Text streaming**: Sent as a sequence of `typing` Activities, followed by a final `message` Activity containing the complete text. +- **Media streaming**: Used with [Media.* events](#reserved-events-for-media-streaming) (`Media.Start`, `Media.Chunk`, `Media.End`) for real-time voice/audio streaming. | Property | Type | Required | Description | |------------------|---------|----------|---------------------------------------------------------------------------------| | `type` | string | Yes | Must be `"streaminfo"` | | `streamId` | string | Yes | Unique identifier for the streaming session | | `streamSequence` | integer | Yes | Incrementing sequence number for each chunk for non-final messages | -| `streamType` | string | No | One of `"informative"`, `"streaming"`, or `"final"`. Defaults to `"streaming"`` | +| `streamType` | string | No | For text: `"informative"`, `"streaming"`, or `"final"`. For media: `"audio"` or `"video"`. Defaults to `"streaming"` | +| `streamState` | string | No | State of the stream: `"streaming"`, `"informative"`, or `"final"`. Used primarily for media streaming | | `streamResult` | string | No | Present only on final message; one of `"success"`, `"timeout"`, or `"error"` | +#### Text Streaming + `A9240`: Streaming text is sent via a sequence of `typing` Activities containing `streaminfo` entities. `A9241`: The final message is sent as a `message` Activity with `streamType` set to `"final"`. @@ -1790,6 +1982,18 @@ The `streaminfo` entity conveys metadata supporting chunked streaming of text me `A9247`: Channels that do not support streaming SHOULD buffer all chunks and deliver a single `message` when complete. +#### Media Streaming + +When used with [Media.* events](#reserved-events-for-media-streaming), the `streaminfo` entity serves as the single place for stream identification, sequencing, and state, independent of the activity type. + +`A9260`: For media streaming, the `streamType` field SHOULD be set to the media type (e.g., `"audio"`, `"video"`). + +`A9261`: For media streaming, the `streamState` field indicates stream lifecycle: `"streaming"` for active chunks, `"final"` for stream end. + +`A9262`: The `streamId` MUST be consistent across all activities in a streaming session (`Media.Start`, `Media.Chunk`, `Media.End`). + +`A9263`: Receivers SHOULD use `streamSequence` to detect out-of-order or missing chunks in media streams. + --- Example: @@ -1923,6 +2127,155 @@ The authenticity of a call from an Agent can be established by inspecting its JS The Microsoft Telephony channel defines channel command activities in the namespace `channel/vnd.microsoft.telephony.`. +## Session Lifecycle Commands + +Session lifecycle commands are used to manage multimodal streaming sessions, particularly for voice interactions. These commands follow request/response semantics with acknowledgments via `commandResult` activities. + +### session.init + +The `session.init` command initializes a new streaming session. It establishes the session context and is acknowledged with a `commandResult` containing the session state. + +**Request:** +```json +{ + "type": "command", + "id": "cmd1", + "name": "session.init", + "value": { + "sessionId": "sess_123" + } +} +``` + +**Response (commandResult):** +```json +{ + "type": "commandResult", + "replyToId": "cmd1", + "value": { + "status": "success", + "sessionId": "sess_123", + "state": "listening" + } +} +``` + +`A9400`: The `session.init` command MUST include a `sessionId` in the `value` object. + +`A9401`: Receivers MUST respond with a `commandResult` activity indicating success or failure. + +`A9402`: A successful `session.init` response MAY include an initial `state` (e.g., `"listening"`), eliminating the need for a separate `session.update`. + +### session.update + +The `session.update` command updates the state of an active session. It is used to signal state transitions during multimodal interactions. + +**Request:** +```json +{ + "type": "command", + "id": "cmd2", + "name": "session.update", + "value": { + "state": "speaking" + } +} +``` + +**Response (commandResult):** +```json +{ + "type": "commandResult", + "replyToId": "cmd2", + "value": { + "status": "acknowledged" + } +} +``` + +Defined session states: + +| State | Description | +|-------------|------------------------------------------------------------| +| `listening` | Bot is awaiting user input (input.expected) | +| `thinking` | Bot is processing the input | +| `speaking` | Bot is generating or delivering output (output.generating) | +| `idle` | Bot is not currently in an active state | +| `error` | An error has occurred during the interaction | + +`A9410`: The `session.update` command SHOULD include a `state` field in the `value` object. + +`A9411`: Receivers SHOULD respond with a `commandResult` activity acknowledging the state change. + +`A9412`: Session state updates are optional and threshold-based; clients may safely ignore them. + +### session.update (Barge-In) + +The `session.update` command can also signal a barge-in event, where the user or system interrupts the current output. + +```json +{ + "type": "command", + "name": "session.update", + "value": { + "signal": "bargeIn", + "origin": "user" + } +} +``` + +`A9420`: A barge-in signal SHOULD include `origin` indicating whether it was triggered by `"user"` or `"system"`. + +`A9421`: Upon receiving a barge-in, the server SHOULD return to the `"listening"` state. + +### session.end + +The `session.end` command terminates an active session. + +```json +{ + "type": "command", + "name": "session.end", + "value": { + "reason": "completed" + } +} +``` + +Defined end reasons: + +| Reason | Description | +|-------------|------------------------------------------| +| `completed` | Session ended normally | +| `cancelled` | Session was cancelled | +| `error` | Session ended due to an error | +| `timeout` | Session ended due to inactivity timeout | + +`A9430`: The `session.end` command SHOULD include a `reason` field in the `value` object. + +`A9431`: Receivers SHOULD clean up session resources upon receiving `session.end`. + +### Multimodal Interaction Flow + +The typical flow for a voice streaming session: + +``` +Client -> Server: + session.init → commandResult (listening) → Media.Start → Media.Chunk x N → Media.End → bargeIn (optional) + +Server -> Client: + Optional session.update (thinking) → Optional session.update (speaking) → Voice.Message + +Barge-In: + Client sends bargeIn → Server returns to listening +``` + +`A9440`: Session lifecycle commands follow request/response semantics; receivers SHOULD send acknowledgments via `commandResult`. + +`A9441`: Session lifecycle commands are required only for real-time streaming modalities (voice, video). + +`A9442`: The `listening` state MAY be embedded in the `session.init` response, making a separate `session.update(listening)` optional. + ## Patterns for rejecting commands ### General pattern for rejecting commands From bee45ec30f61dc389f2c541c047a4834669df836 Mon Sep 17 00:00:00 2001 From: Guruprasad Bangalore Hiriyannaiah Date: Thu, 5 Feb 2026 12:07:25 +0530 Subject: [PATCH 2/7] Update: Use existing streamType values for media streaming (no schema changes) Per discussion on #416, the existing streaminfo entity properties are sufficient for media streaming: - streamType uses existing values: 'streaming', 'final' (not new 'audio'/'video') - valueType on the event activity identifies the media type - No need for new streamState property This ensures zero schema changes to streaminfo entity while supporting multimodal media streaming. --- specs/activity/protocol-activity.md | 32 ++++++++++++----------------- 1 file changed, 13 insertions(+), 19 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index de75bba9..a7308459 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -658,7 +658,7 @@ The `Media.Start` event initiates a media streaming session. It establishes the | `name` | string | Yes | Must be `"Media.Start"` | | `valueType` | string | No | Identifies the schema of the `value` object, e.g., `"application/vnd.microsoft.activity.mediastart+json"` | | `value` | object | No | Contains media type and content type information | -| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamState` of `"streaming"` | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamType` of `"streaming"` | Example: ```json @@ -674,8 +674,7 @@ Example: { "type": "streamInfo", "streamId": "abc123", - "streamType": "audio", - "streamState": "streaming", + "streamType": "streaming", "streamSequence": 1 } ] @@ -725,8 +724,7 @@ Example: { "type": "streamInfo", "streamId": "abc123", - "streamType": "audio", - "streamState": "streaming", + "streamType": "streaming", "streamSequence": 2 } ] @@ -748,7 +746,7 @@ The `Media.End` event signals the end of a media streaming session. | `type` | string | Yes | Must be `"event"` | | `name` | string | Yes | Must be `"Media.End"` | | `valueType` | string | No | Identifies the schema, e.g., `"application/vnd.microsoft.activity.mediaend+json"` | -| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamState` of `"final"` | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamType` of `"final"` | Example: ```json @@ -760,15 +758,14 @@ Example: { "type": "streamInfo", "streamId": "abc123", - "streamType": "audio", - "streamState": "final", + "streamType": "final", "streamSequence": 3 } ] } ``` -`A5240`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.End` events with `streamState` set to `"final"`. +`A5240`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.End` events with `streamType` set to `"final"`. `A5241`: Receivers SHOULD clean up stream resources upon receiving `Media.End`. @@ -1775,10 +1772,10 @@ The `error` field contains the reason the original [command activity](#command-a # 2025-02-05 - guhiriya@microsoft.com * Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`, `Voice.Message`) -* Extended `streaminfo` entity to support media streaming with `streamState` property +* Documented usage of existing `streaminfo` entity for media streaming (no schema changes) * Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal interactions * Added normative requirements A5210-A5252 for media streaming events -* Added normative requirements A9260-A9263 for media streaming in streaminfo +* Added normative requirements A9260-A9262 for media streaming in streaminfo * Added normative requirements A9400-A9442 for session lifecycle commands # 2025-09-30 - mattb-msft @@ -1960,8 +1957,7 @@ The `streaminfo` entity conveys metadata supporting chunked streaming of message | `type` | string | Yes | Must be `"streaminfo"` | | `streamId` | string | Yes | Unique identifier for the streaming session | | `streamSequence` | integer | Yes | Incrementing sequence number for each chunk for non-final messages | -| `streamType` | string | No | For text: `"informative"`, `"streaming"`, or `"final"`. For media: `"audio"` or `"video"`. Defaults to `"streaming"` | -| `streamState` | string | No | State of the stream: `"streaming"`, `"informative"`, or `"final"`. Used primarily for media streaming | +| `streamType` | string | No | One of `"informative"`, `"streaming"`, or `"final"`. Defaults to `"streaming"` | | `streamResult` | string | No | Present only on final message; one of `"success"`, `"timeout"`, or `"error"` | #### Text Streaming @@ -1984,15 +1980,13 @@ The `streaminfo` entity conveys metadata supporting chunked streaming of message #### Media Streaming -When used with [Media.* events](#reserved-events-for-media-streaming), the `streaminfo` entity serves as the single place for stream identification, sequencing, and state, independent of the activity type. - -`A9260`: For media streaming, the `streamType` field SHOULD be set to the media type (e.g., `"audio"`, `"video"`). +When used with [Media.* events](#reserved-events-for-media-streaming), the `streaminfo` entity serves as the single place for stream identification and sequencing, independent of the activity type. The existing `streamType` values (`"streaming"`, `"final"`) are used to indicate stream lifecycle, while the `valueType` field on the event activity identifies the media type. -`A9261`: For media streaming, the `streamState` field indicates stream lifecycle: `"streaming"` for active chunks, `"final"` for stream end. +`A9260`: For media streaming, the `streamType` field uses existing values: `"streaming"` for active chunks, `"final"` for stream end. -`A9262`: The `streamId` MUST be consistent across all activities in a streaming session (`Media.Start`, `Media.Chunk`, `Media.End`). +`A9261`: The `streamId` MUST be consistent across all activities in a streaming session (`Media.Start`, `Media.Chunk`, `Media.End`). -`A9263`: Receivers SHOULD use `streamSequence` to detect out-of-order or missing chunks in media streams. +`A9262`: Receivers SHOULD use `streamSequence` to detect out-of-order or missing chunks in media streams. --- From 8a0179f07d5522dc8438e4fb336dc9c794640b91 Mon Sep 17 00:00:00 2001 From: Guruprasad Bangalore Hiriyannaiah Date: Thu, 5 Feb 2026 12:14:09 +0530 Subject: [PATCH 3/7] Add voice/media streaming example alongside text streaming example Added separate examples in streaminfo section: - Text Streaming: Existing example using typing/message activities - Voice/Media Streaming: New example showing Media.Start, Media.Chunk, Media.End, and Voice.Message events with streaminfo entities Both examples demonstrate consistent use of streamType values (streaming, final) while different activity types and valueType distinguish the modality. --- specs/activity/protocol-activity.md | 104 +++++++++++++++++++++++++++- 1 file changed, 103 insertions(+), 1 deletion(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index a7308459..a42a4e30 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -1990,7 +1990,10 @@ When used with [Media.* events](#reserved-events-for-media-streaming), the `stre --- -Example: +#### Example: Text Streaming + +Text streaming uses `typing` activities for incremental chunks, followed by a final `message` activity: + ```json // Sending an informative message chunk { @@ -2036,6 +2039,105 @@ Example: } ``` +#### Example: Voice/Media Streaming + +Voice streaming uses `event` activities with [Media.* events](#reserved-events-for-media-streaming). The `valueType` identifies the media type, while `streaminfo` handles sequencing: + +```json +// Media.Start - Initiate audio streaming session +{ + "type": "event", + "name": "Media.Start", + "valueType": "application/vnd.microsoft.activity.mediastart+json", + "value": { + "mediaType": "audio", + "contentType": "audio/webm" + }, + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "streaming", + "streamSequence": 1 + } + ] +} + +// Media.Chunk - Send audio chunk with optional transcription +{ + "type": "event", + "name": "Media.Chunk", + "valueType": "application/vnd.microsoft.activity.audiochunk+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,GkXfo59ChoEBQveBAU...", + "durationMs": 2500, + "timestamp": "2025-10-07T10:30:05Z", + "transcription": "Book a flight to" + }, + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "streaming", + "streamSequence": 2 + } + ] +} + +// Media.Chunk - Continue streaming +{ + "type": "event", + "name": "Media.Chunk", + "valueType": "application/vnd.microsoft.activity.audiochunk+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,R0lGODlhAQABAIAA...", + "durationMs": 1800, + "timestamp": "2025-10-07T10:30:07Z", + "transcription": "Paris please" + }, + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "streaming", + "streamSequence": 3 + } + ] +} + +// Media.End - Signal end of audio stream +{ + "type": "event", + "name": "Media.End", + "valueType": "application/vnd.microsoft.activity.mediaend+json", + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "final", + "streamSequence": 4 + } + ] +} + +// Voice.Message - Final complete voice response (Server to Client) +{ + "type": "event", + "name": "Voice.Message", + "valueType": "application/vnd.microsoft.activity.voice+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,UklGRiQAAABXQVZF...", + "transcription": "I found flights to Paris. The next available is tomorrow at 8:05am.", + "durationMs": 4200, + "timestamp": "2025-10-07T10:30:12Z", + "locale": "en-US" + } +} +``` + # Appendix III - Protocols using the Invoke activity The [invoke activity](#invoke-activity) is designed for use only within protocols supported by Activity Protocol channels (i.e., it is not a generic extensibility mechanism). This appendix contains a list of all protocols using this activity. From a14bb5a497bf9748408910a78de7d85ad323a6a5 Mon Sep 17 00:00:00 2001 From: Guruprasad Bangalore Hiriyannaiah Date: Thu, 5 Feb 2026 12:26:59 +0530 Subject: [PATCH 4/7] Add implementation notes, error handling, and clarifications Based on comprehensive review of proposal #416: 1. Added Implementation Note for Voice.Message explaining: - Why event is used instead of message (SDK validation limitation) - Protocol does allow value/valueType on message (A2005) - Reference to future APv4 vision (#377) 2. Added Error Handling section (A5260-A5262): - Handling Media.Chunk without Media.Start - Stream error signaling via streamResult - Resilience requirements for missing chunks 3. Added Note clarifying session.* commands are reserved protocol commands (not subject to application/* namespace requirement per A6301) These additions address gaps identified during comprehensive review and capture the open discussion points from the proposal. --- specs/activity/protocol-activity.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index a42a4e30..8cf68da8 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -773,6 +773,12 @@ Example: The `Voice.Message` event delivers a complete voice message, either as a final response after streaming or as a standalone voice message. +> **Implementation Note:** +> +> The Activity Protocol schema permits `value` and `valueType` on `message` activities (per A2005). However, current SDK implementations may not fully support this combination for validation purposes. For GA compatibility, `Voice.Message` is defined as an `event` activity. This ensures consistent behavior across all existing Bot Framework, Azure Bot Service, and Teams clients. +> +> Future versions (APv4+) may unify voice messages under the `message` activity type for consistency with text messages. See [#377](https://github.com/microsoft/Agents/issues/377) for the longer-term vision. + | Field | Type | Required | Description | |-------------|--------|----------|--------------------------------------------------| | `type` | string | Yes | Must be `"event"` | @@ -814,6 +820,14 @@ Example: `A5252`: Senders SHOULD include a `transcription` field to support accessibility and text-based processing. +#### Error Handling + +`A5260`: If a `Media.Chunk` event is received without a corresponding `Media.Start`, receivers MAY ignore it or MAY process it if the `streamId` is known from a prior session. + +`A5261`: If a stream error occurs, senders SHOULD send a `Media.End` event with `streamResult` set to `"error"` in the `streaminfo` entity. + +`A5262`: Receivers SHOULD be resilient to missing chunks and SHOULD use `streamSequence` to detect gaps. + ## Invoke activity @@ -2227,6 +2241,8 @@ The Microsoft Telephony channel defines channel command activities in the namesp Session lifecycle commands are used to manage multimodal streaming sessions, particularly for voice interactions. These commands follow request/response semantics with acknowledgments via `commandResult` activities. +> **Note:** The `session.*` command names are reserved Activity Protocol commands for multimodal session management. Unlike application-defined commands (which must use the `application/*` namespace per A6301), these are protocol-level commands similar to other reserved event names. + ### session.init The `session.init` command initializes a new streaming session. It establishes the session context and is acknowledged with a `commandResult` containing the session state. From 068060dafb85a2cab63a6d3931d0d9126f00eef7 Mon Sep 17 00:00:00 2001 From: Guruprasad Bangalore Hiriyannaiah Date: Thu, 5 Feb 2026 12:35:33 +0530 Subject: [PATCH 5/7] Add detailed round-trip flow example for multimodal interaction Added the detailed client-server interaction example from proposal #416: - Session handshake with embedded readiness state - Media streaming events (start, chunk, end) - Optional state updates (thinking, speaking) with threshold notes - Final Voice.Message delivery - Explanatory notes about optional steps This provides a complete reference for implementers to understand the end-to-end flow of a voice streaming session. --- specs/activity/protocol-activity.md | 39 +++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 8cf68da8..b0d89897 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -2382,6 +2382,45 @@ Barge-In: Client sends bargeIn → Server returns to listening ``` +#### Round-Trip Flow Example: Client and Server Interaction + +``` +// Client → Server: establish session (handshake) + client → command: session.init + server → commandResult: ack { "status": "success", "sessionId": "SESS-123", "state": "listening" } + // Because readiness is embedded in the response above, a separate + // session.update(state="listening") call is NOT required. + +// This step is required only if the channel or runtime explicitly requires a +// readiness signal, depending on implementation details. +// Server → Client (readiness) + server → command: session.update(state="listening", sessionId:"SESS-123") + client → commandResult: ack + +// Client → Server: stream media (fire-and-forget events) + client → event: Media.Start (streamId=STR-1, contentType=audio/webm) + client → event: Media.Chunk (streamId=STR-1, seq=1, ...) + ... (more Media.Chunk) + client → event: Media.End (streamId=STR-1) + +// These updates are optional and rate-limited. Clients may safely ignore them. +// They fire only when thresholds are crossed (e.g., >200ms of "thinking"), +// depending on implementation details. +// Server → Client: (optional, per threshold/config) processing + speak phases + server → command: session.update(state=thinking, sessionId=SESS-123) + client → commandResult: ack + server → command: session.update(state=speaking, sessionId=SESS-123) + client → commandResult: ack + +// Server → Client: final user-visible content + server → event: Voice.Message valueType=application/vnd.microsoft.activity.voice+json + value={ "contentType":"audio/webm", "contentUrl":"...", ... } +``` + +> **Note:** +> - `listening` is NOT needed as a separate step if included in the `session.init` commandResult. +> - `thinking` and `speaking` session.update messages are optional and threshold-based. + `A9440`: Session lifecycle commands follow request/response semantics; receivers SHOULD send acknowledgments via `commandResult`. `A9441`: Session lifecycle commands are required only for real-time streaming modalities (voice, video). From 558014b4958676f9aa07cd5d3fd85c2593157609 Mon Sep 17 00:00:00 2001 From: Guruprasad Bangalore Hiriyannaiah Date: Thu, 5 Feb 2026 12:41:14 +0530 Subject: [PATCH 6/7] Improve formatting of examples for better markdown rendering - Split large JSON code blocks with comments into separate blocks - Added descriptive headers before each example - Used proper code block language specifiers (json, text) - Reorganized multimodal interaction flow into numbered steps - Added blockquotes for explanatory notes - Removed invalid JSON comments (JSON doesn't support //) This improves readability when the spec is rendered in GitHub, documentation sites, and other markdown viewers. --- specs/activity/protocol-activity.md | 110 +++++++++++++++++----------- 1 file changed, 68 insertions(+), 42 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index b0d89897..3d87fefd 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -2008,8 +2008,8 @@ When used with [Media.* events](#reserved-events-for-media-streaming), the `stre Text streaming uses `typing` activities for incremental chunks, followed by a final `message` activity: +**Informative message** - Show processing status: ```json -// Sending an informative message chunk { "type": "typing", "text": "Getting the answer...", @@ -2023,8 +2023,10 @@ Text streaming uses `typing` activities for incremental chunks, followed by a fi } ] } +``` -// Sending a streaming text chunk +**Streaming text chunk** - Incremental content: +```json { "type": "typing", "text": "A quick brown fox jumped over the", @@ -2037,8 +2039,10 @@ Text streaming uses `typing` activities for incremental chunks, followed by a fi } ] } +``` -// Sending the final complete message +**Final complete message** - Full response: +```json { "type": "message", "text": "A quick brown fox jumped over the lazy dog.", @@ -2057,8 +2061,8 @@ Text streaming uses `typing` activities for incremental chunks, followed by a fi Voice streaming uses `event` activities with [Media.* events](#reserved-events-for-media-streaming). The `valueType` identifies the media type, while `streaminfo` handles sequencing: +**Media.Start** - Initiate audio streaming session: ```json -// Media.Start - Initiate audio streaming session { "type": "event", "name": "Media.Start", @@ -2076,8 +2080,10 @@ Voice streaming uses `event` activities with [Media.* events](#reserved-events-f } ] } +``` -// Media.Chunk - Send audio chunk with optional transcription +**Media.Chunk** - Send audio chunk with optional transcription: +```json { "type": "event", "name": "Media.Chunk", @@ -2098,8 +2104,10 @@ Voice streaming uses `event` activities with [Media.* events](#reserved-events-f } ] } +``` -// Media.Chunk - Continue streaming +**Media.Chunk** - Continue streaming (additional chunks): +```json { "type": "event", "name": "Media.Chunk", @@ -2120,8 +2128,10 @@ Voice streaming uses `event` activities with [Media.* events](#reserved-events-f } ] } +``` -// Media.End - Signal end of audio stream +**Media.End** - Signal end of audio stream: +```json { "type": "event", "name": "Media.End", @@ -2135,8 +2145,10 @@ Voice streaming uses `event` activities with [Media.* events](#reserved-events-f } ] } +``` -// Voice.Message - Final complete voice response (Server to Client) +**Voice.Message** - Final complete voice response (Server to Client): +```json { "type": "event", "name": "Voice.Message", @@ -2371,11 +2383,11 @@ Defined end reasons: The typical flow for a voice streaming session: -``` -Client -> Server: +```text +Client → Server: session.init → commandResult (listening) → Media.Start → Media.Chunk x N → Media.End → bargeIn (optional) -Server -> Client: +Server → Client: Optional session.update (thinking) → Optional session.update (speaking) → Voice.Message Barge-In: @@ -2384,40 +2396,54 @@ Barge-In: #### Round-Trip Flow Example: Client and Server Interaction +The following example illustrates a complete voice streaming interaction: + +**Step 1: Session Handshake** +```text +client → command: session.init +server → commandResult: { "status": "success", "sessionId": "SESS-123", "state": "listening" } ``` -// Client → Server: establish session (handshake) - client → command: session.init - server → commandResult: ack { "status": "success", "sessionId": "SESS-123", "state": "listening" } - // Because readiness is embedded in the response above, a separate - // session.update(state="listening") call is NOT required. - -// This step is required only if the channel or runtime explicitly requires a -// readiness signal, depending on implementation details. -// Server → Client (readiness) - server → command: session.update(state="listening", sessionId:"SESS-123") - client → commandResult: ack - -// Client → Server: stream media (fire-and-forget events) - client → event: Media.Start (streamId=STR-1, contentType=audio/webm) - client → event: Media.Chunk (streamId=STR-1, seq=1, ...) - ... (more Media.Chunk) - client → event: Media.End (streamId=STR-1) - -// These updates are optional and rate-limited. Clients may safely ignore them. -// They fire only when thresholds are crossed (e.g., >200ms of "thinking"), -// depending on implementation details. -// Server → Client: (optional, per threshold/config) processing + speak phases - server → command: session.update(state=thinking, sessionId=SESS-123) - client → commandResult: ack - server → command: session.update(state=speaking, sessionId=SESS-123) - client → commandResult: ack - -// Server → Client: final user-visible content - server → event: Voice.Message valueType=application/vnd.microsoft.activity.voice+json - value={ "contentType":"audio/webm", "contentUrl":"...", ... } +> Because readiness (`listening`) is embedded in the response above, a separate `session.update(state="listening")` call is NOT required. + +**Step 2: Readiness Signal (Optional)** + +This step is required only if the channel or runtime explicitly requires a readiness signal: +```text +server → command: session.update { "state": "listening", "sessionId": "SESS-123" } +client → commandResult: { "status": "acknowledged" } ``` -> **Note:** +**Step 3: Stream Media (Fire-and-Forget Events)** +```text +client → event: Media.Start { streamId: "STR-1", contentType: "audio/webm" } +client → event: Media.Chunk { streamId: "STR-1", seq: 1, ... } +client → event: Media.Chunk { streamId: "STR-1", seq: 2, ... } + ... (more Media.Chunk events) +client → event: Media.End { streamId: "STR-1" } +``` + +**Step 4: Processing State Updates (Optional)** + +These updates are optional and rate-limited. Clients may safely ignore them. They fire only when thresholds are crossed (e.g., >200ms of "thinking"): +```text +server → command: session.update { "state": "thinking", "sessionId": "SESS-123" } +client → commandResult: { "status": "acknowledged" } + +server → command: session.update { "state": "speaking", "sessionId": "SESS-123" } +client → commandResult: { "status": "acknowledged" } +``` + +**Step 5: Final Voice Response** +```text +server → event: Voice.Message + valueType: "application/vnd.microsoft.activity.voice+json" + value: { "contentType": "audio/webm", "contentUrl": "...", "transcription": "..." } +``` + +> **Notes:** +> - `listening` is NOT needed as a separate step if included in the `session.init` commandResult. +> - `thinking` and `speaking` session.update messages are optional and threshold-based. +> - Media streaming events are fire-and-forget (no acknowledgment required). > - `listening` is NOT needed as a separate step if included in the `session.init` commandResult. > - `thinking` and `speaking` session.update messages are optional and threshold-based. From b22e16372da143877d9d239bc27d3f0ffb2302c3 Mon Sep 17 00:00:00 2001 From: Guruprasad Bangalore Hiriyannaiah Date: Thu, 5 Feb 2026 12:48:47 +0530 Subject: [PATCH 7/7] fix: remove duplicate notes in Multimodal Interaction Flow --- specs/activity/protocol-activity.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 3d87fefd..c665009e 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -2443,9 +2443,7 @@ server → event: Voice.Message > **Notes:** > - `listening` is NOT needed as a separate step if included in the `session.init` commandResult. > - `thinking` and `speaking` session.update messages are optional and threshold-based. -> - Media streaming events are fire-and-forget (no acknowledgment required). -> - `listening` is NOT needed as a separate step if included in the `session.init` commandResult. -> - `thinking` and `speaking` session.update messages are optional and threshold-based. +> - Media streaming events are fire-and-forget (no acknowledgment required). `A9440`: Session lifecycle commands follow request/response semantics; receivers SHOULD send acknowledgments via `commandResult`.