Skip to content

Extend Activity Schema to Support Multimodal Interactions with Streaming#423

Draft
gurubhg wants to merge 7 commits intomainfrom
users/guhiriya/extend-activity-schema-multimodal-streaming
Draft

Extend Activity Schema to Support Multimodal Interactions with Streaming#423
gurubhg wants to merge 7 commits intomainfrom
users/guhiriya/extend-activity-schema-multimodal-streaming

Conversation

@gurubhg
Copy link
Contributor

@gurubhg gurubhg commented Feb 5, 2026

This PR implements the approved proposal from issue #416 to extend the Activity Protocol schema for multimodal interactions with streaming support for voice/audio.

Changes:

  • Added Reserved Events for Media Streaming (Media.Start, Media.Chunk, Media.End, Voice.Message)
  • Extended streaminfo entity to support media streaming with streamState property
  • Added Session Lifecycle Commands (session.init, session.update, session.end) for multimodal interactions
  • Bumped version to Provisional 3.4

Key design decisions (per AP Core Committee):

  • No new activity types - uses existing event, command, commandResult
  • No new schema fields - uses existing value, valueType, entities
  • 100% backward compatible
  • Uses streamInfo entity for stream metadata and sequencing
  • Uses Media.* prefix for media streaming events

Related: #416

This PR implements the approved proposal from issue #416 to extend the Activity
Protocol schema for multimodal interactions with streaming support for voice/audio.

Changes:
- Added Reserved Events for Media Streaming (Media.Start, Media.Chunk, Media.End,
  Voice.Message)
- Extended streaminfo entity to support media streaming with streamState property
- Added Session Lifecycle Commands (session.init, session.update, session.end)
  for multimodal interactions
- Bumped version to Provisional 3.4

Key design decisions (per AP Core Committee):
- No new activity types - uses existing event, command, commandResult
- No new schema fields - uses existing value, valueType, entities
- 100% backward compatible
- Uses streamInfo entity for stream metadata and sequencing
- Uses Media.* prefix for media streaming events

Related: #416
@github-actions github-actions bot added the Specs This is related to Activity Protocol Specification label Feb 5, 2026
… changes)

Per discussion on #416, the existing streaminfo entity properties are sufficient
for media streaming:
- streamType uses existing values: 'streaming', 'final' (not new 'audio'/'video')
- valueType on the event activity identifies the media type
- No need for new streamState property

This ensures zero schema changes to streaminfo entity while supporting
multimodal media streaming.
Added separate examples in streaminfo section:
- Text Streaming: Existing example using typing/message activities
- Voice/Media Streaming: New example showing Media.Start, Media.Chunk,
  Media.End, and Voice.Message events with streaminfo entities

Both examples demonstrate consistent use of streamType values (streaming, final)
while different activity types and valueType distinguish the modality.
Based on comprehensive review of proposal #416:

1. Added Implementation Note for Voice.Message explaining:
   - Why event is used instead of message (SDK validation limitation)
   - Protocol does allow value/valueType on message (A2005)
   - Reference to future APv4 vision (#377)

2. Added Error Handling section (A5260-A5262):
   - Handling Media.Chunk without Media.Start
   - Stream error signaling via streamResult
   - Resilience requirements for missing chunks

3. Added Note clarifying session.* commands are reserved protocol commands
   (not subject to application/* namespace requirement per A6301)

These additions address gaps identified during comprehensive review and
capture the open discussion points from the proposal.
Added the detailed client-server interaction example from proposal #416:
- Session handshake with embedded readiness state
- Media streaming events (start, chunk, end)
- Optional state updates (thinking, speaking) with threshold notes
- Final Voice.Message delivery
- Explanatory notes about optional steps

This provides a complete reference for implementers to understand
the end-to-end flow of a voice streaming session.
- Split large JSON code blocks with comments into separate blocks
- Added descriptive headers before each example
- Used proper code block language specifiers (json, text)
- Reorganized multimodal interaction flow into numbered steps
- Added blockquotes for explanatory notes
- Removed invalid JSON comments (JSON doesn't support //)

This improves readability when the spec is rendered in GitHub,
documentation sites, and other markdown viewers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specs This is related to Activity Protocol Specification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant