[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources by ericm-db · Pull Request #54577 · apache/spark

ericm-db · 2026-03-02T18:00:48Z

What changes were proposed in this pull request?

This PR introduces an automatic offset log upgrade mechanism that allows streaming queries to migrate from V1 (positional) offset tracking to V2 (named) offset tracking when users add .name() to their streaming sources.

Key components:

OffsetSeq.toOffsetMap() - Converts V1 positional offsets to V2 named offsets using provided source names
- Validates source count matches between offset log and current plan
- Detects and prevents duplicate source names that would cause data loss
- Migrates V1 metadata to V2 format
MicroBatchExecution upgrade logic - Orchestrates the automatic offset log upgrade
- Only upgrades when ALL conditions are met:
  - Current offset log is V1
  - User explicitly requests V2 via spark.sql.streaming.offsetLog.formatVersion=2
  - All sources are named (not Unassigned)
  - User sets spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true
- Creates an "upgrade batch" that converts and commits the new offset log format
- Fails loudly with clear error message if upgrade config not set
- Skips upgrade if uncommitted batch exists (requires clean state)
Safety validations:
- Source count mismatch detection
- Duplicate source name detection (two layers)
- Concurrent modification detection
- Clean state requirement (no uncommitted batches)
Comprehensive test suite - OffsetLogV1ToV2UpgradeSuite with tests for:
- Happy path upgrade with multiple sources
- No upgrade when sources unnamed
- V2 offset log stability
- Multi-source offset mapping correctness
- Source count mismatch error handling
- Missing upgrade config error handling

Why are the changes needed?

Currently, when users want to migrate from V1 (index-based) to V2 (name-based) offset tracking, they must:

Delete their checkpoint directory (losing all state)
Start fresh

This is problematic because:

State loss: All stateful operators (aggregations, joins, deduplication) lose their state
Data reprocessing: Query must reprocess all historical data from the beginning
Downtime: Requires stopping the query and careful coordination

With this change, users can safely migrate existing V1 offset logs to V2 format by:

Adding .name() to all streaming sources
Setting spark.sql.streaming.offsetLog.formatVersion=2
Setting spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true
Restarting the query

The upgrade preserves all state and offset positions, enabling seamless transition to the more flexible V2 format that supports source evolution (adding/removing sources by name).

Does this PR introduce any user-facing change?

Yes. This PR introduces two new behaviors:

1. New config (default: false)

spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=false

When set to true, enables automatic V1 to V2 offset log upgrade when conditions are met.

2. New error message when upgrade needed but not enabled:

Previous behavior: Query would continue with V1 format (or fail with unclear error)

New behavior: Clear error message when V1 offset log exists, V2 requested, and upgrade config not set:

IllegalStateException: Offset log is in V1 format but V2 format was requested via 
spark.sql.streaming.offsetLog.formatVersion=2. To migrate the offset log, set 
spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true. 
Important: This is a one-way migration that cannot be rolled back. 
Ensure all batches are committed before enabling. See documentation for details.

This is a backwards compatible change - existing V1 queries continue working unchanged unless users explicitly opt into the upgrade.

How was this patch tested?

Added comprehensive test suite OffsetLogV1ToV2UpgradeSuite with the following test cases:

V1 offset log + all sources named auto-upgrades to V2
- Creates V1 offset log with unnamed sources
- Restarts with named sources + V2 config + upgrade config
- Verifies upgrade occurs and offsets are keyed by name
V1 offset log + no sources named continues with V1
- Verifies no upgrade when sources remain unnamed
Already V2 offset log + named sources continues with V2
- Verifies stability (no regression) for existing V2 offset logs
Multi-source upgrade preserves all offsets correctly
- Tests 3-source upgrade with names "payments", "refunds", "adjustments"
- Verifies all offsets correctly mapped by name
Source count mismatch throws clear error
- Creates V1 offset log with 2 sources
- Attempts upgrade with 3 sources
- Verifies clear error message about source count mismatch
V1 offset log + V2 requested without upgrade config throws clear error
- Verifies the new error message when upgrade config not set
- Ensures users get clear guidance on what config to set

All tests use real file-based streaming sources to ensure end-to-end correctness.

Was this patch authored or co-authored using generative AI tooling?

No.

- Add validation in OffsetSeq.toOffsetMap to detect duplicate source names that would cause silent data loss - Add validation in MicroBatchExecution when building V2 sourceIdMap to prevent duplicate names - Remove unused test parameter (outputDir) - Fix unused variables in tests (v1BatchId, v2BatchId) These changes improve robustness by failing fast with clear error messages when duplicate source names are detected, preventing silent data loss during V1-to-V2 checkpoint upgrade or new V2 query initialization.

Add spark.sql.streaming.checkpoint.v1ToV2.autoUpgrade.enabled config to require explicit user opt-in for V1 to V2 checkpoint migration. Previously, the upgrade would happen automatically when users set offsetLog.formatVersion=2 and added names to sources, which could be surprising for production systems. Changes: - Add STREAMING_CHECKPOINT_V1_TO_V2_AUTO_UPGRADE_ENABLED config (default: false) - Update MicroBatchExecution to check config and throw clear error if not enabled - Error message explains the migration requirement and consequences - Update all existing upgrade tests to set the new config - Add test to verify error message when config is not set This ensures users explicitly acknowledge the one-way, irreversible nature of the checkpoint migration before it occurs.

…inology Update all references from "checkpoint upgrade" to "offset log upgrade" to avoid confusion with checkpoint v2 (state store checkpointing). This change clarifies that we're specifically upgrading the offset log format, not the entire checkpoint structure. Changes: - Rename config: checkpoint.v1ToV2.autoUpgrade.enabled → offsetLog.v1ToV2.autoUpgrade.enabled - Rename function: maybeUpgradeCheckpointToV2 → maybeUpgradeOffsetLogToV2 - Rename test suite: CheckpointV1ToV2UpgradeSuite → OffsetLogV1ToV2UpgradeSuite - Update all error messages, comments, and test names to use "offset log" - Update PR title and description This is a terminology-only change with no functional modifications.

Previously, when a user requested V2 offset log format via spark.sql.streaming.offsetLog.formatVersion=2 but didn't name their sources with .name(), the system would silently ignore the V2 request and continue using V1 format. This commit adds a clear error message when V2 is requested but sources aren't named, explaining that V2 format requires all sources to have names. Changes: - Add validation in MicroBatchExecution to check if sources are named when V2 format is requested - Throw IllegalStateException with clear guidance when unnamed sources are detected with V2 request - Add test case to verify the error is thrown with correct message Error message guides users to either: 1. Add .name() to all sources, or 2. Set offsetLog.formatVersion=1 to continue with V1 format

Enable auto-upgrade from V1 to V2 offset log format even when sources are not named, using positional indices ("0", "1", "2") as keys. Rationale: V1 format already assumes plan stability between batches (index 0 must always map to the same source). The V1→V2 upgrade makes the exact same assumption, so requiring names was unnecessarily strict. Changes: - Add generateSourceIds() to produce either named or positional IDs - Remove requirement for allSourcesNamed in upgrade path - Use named keys when available, fall back to positional indices - Update maybeUpgradeOffsetLogToV2 to handle both cases - Rebuild sourceIdMap for V2 regardless of naming - Log whether upgrade used "named" or "positional" keys - Update test to verify positional key upgrade works Benefits: - Users can upgrade to V2 without naming all sources - Named keys still preferred when available (better debugging) - No stronger assumptions than V1 already makes - Easier migration path for existing queries Example upgrade with unnamed sources: V1: [paymentOffset, refundOffset] (positional) V2: {"0": paymentOffset, "1": refundOffset} (positional in map)

Fix bug where V2 checkpoints could have their key scheme inadvertently changed on restart if users added/removed source names. Problem: 1. Upgrade V1→V2 with source evolution disabled → Creates {"0", "1"} 2. User later adds .name("my_source") to source 3. Query restart regenerates sourceIdMap with key "my_source" 4. Offset log still has key "0" → mismatch! 💥 Solution: Read the persisted ENABLE_STREAMING_SOURCE_EVOLUTION config from the existing V2 checkpoint metadata to determine the original key scheme: - If source evolution was ENABLED → use named keys (allows evolution) - If source evolution was DISABLED → use positional keys forever This ensures the key scheme is immutable once a V2 checkpoint is created, preventing accidental breakage. Changes: - Check existing V2 checkpoint's source evolution config from metadata - Use that to determine whether to generate positional or named keys - Add comment explaining protection against re-upgrading V2 checkpoints - Add test verifying positional keys persist even when names added later Example: V2 created with source evolution disabled + unnamed sources: - Keys: {"0", "1"} → forever positional - Adding .name() later → still uses {"0", "1"} V2 created with source evolution enabled + named sources: - Keys: {"payments", "refunds"} → can add/remove sources - Renaming sources → uses new names (evolution allowed)

…-dash

…l key persistence

changes

05401a9

ericm-db force-pushed the streaming-offset-v1-to-v2-migration branch from fb20d1d to 6f02737 Compare March 2, 2026 18:02

ericm-db force-pushed the streaming-offset-v1-to-v2-migration branch from 6f02737 to 9768076 Compare March 2, 2026 18:04

ericm-db changed the title ~~Streaming offset v1 to v2 migration~~ [SPARK-XXXXX][SS] Add automatic V1 to V2 checkpoint upgrade for streaming queries with named sources Mar 2, 2026

ericm-db changed the title ~~[SPARK-XXXXX][SS] Add automatic V1 to V2 checkpoint upgrade for streaming queries with named sources~~ [SPARK-XXXXX][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources Mar 2, 2026

ericm-db changed the title ~~[SPARK-XXXXX][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources~~ [SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources Mar 2, 2026

ericm-db added 8 commits March 2, 2026 11:57

Fix line length violations (max 100 characters)

cbf9fb4

Fix compilation errors: remove MDC wrapper for keyType and replace em…

2a1ab60

…-dash

Use plain string interpolation for upgrade log message

e51ae07

Replace Unicode arrow with ASCII arrow in comment

0ba30a5

Fix test: enable source evolution when adding names to test positiona…

ade713c

…l key persistence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources#54577

[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources#54577
ericm-db wants to merge 12 commits intoapache:masterfrom
ericm-db:streaming-offset-v1-to-v2-migration

ericm-db commented Mar 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericm-db commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ericm-db commented Mar 2, 2026 •

edited

Loading