Skip to content

[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources#54577

Open
ericm-db wants to merge 12 commits intoapache:masterfrom
ericm-db:streaming-offset-v1-to-v2-migration
Open

[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources#54577
ericm-db wants to merge 12 commits intoapache:masterfrom
ericm-db:streaming-offset-v1-to-v2-migration

Conversation

@ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Mar 2, 2026

What changes were proposed in this pull request?

This PR introduces an automatic offset log upgrade mechanism that allows streaming queries to migrate from V1 (positional) offset tracking to V2 (named) offset tracking when users add .name() to their streaming sources.

Key components:

  1. OffsetSeq.toOffsetMap() - Converts V1 positional offsets to V2 named offsets using provided source names

    • Validates source count matches between offset log and current plan
    • Detects and prevents duplicate source names that would cause data loss
    • Migrates V1 metadata to V2 format
  2. MicroBatchExecution upgrade logic - Orchestrates the automatic offset log upgrade

    • Only upgrades when ALL conditions are met:
      • Current offset log is V1
      • User explicitly requests V2 via spark.sql.streaming.offsetLog.formatVersion=2
      • All sources are named (not Unassigned)
      • User sets spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true
    • Creates an "upgrade batch" that converts and commits the new offset log format
    • Fails loudly with clear error message if upgrade config not set
    • Skips upgrade if uncommitted batch exists (requires clean state)
  3. Safety validations:

    • Source count mismatch detection
    • Duplicate source name detection (two layers)
    • Concurrent modification detection
    • Clean state requirement (no uncommitted batches)
  4. Comprehensive test suite - OffsetLogV1ToV2UpgradeSuite with tests for:

    • Happy path upgrade with multiple sources
    • No upgrade when sources unnamed
    • V2 offset log stability
    • Multi-source offset mapping correctness
    • Source count mismatch error handling
    • Missing upgrade config error handling

Why are the changes needed?

Currently, when users want to migrate from V1 (index-based) to V2 (name-based) offset tracking, they must:

  1. Delete their checkpoint directory (losing all state)
  2. Start fresh

This is problematic because:

  • State loss: All stateful operators (aggregations, joins, deduplication) lose their state
  • Data reprocessing: Query must reprocess all historical data from the beginning
  • Downtime: Requires stopping the query and careful coordination

With this change, users can safely migrate existing V1 offset logs to V2 format by:

  1. Adding .name() to all streaming sources
  2. Setting spark.sql.streaming.offsetLog.formatVersion=2
  3. Setting spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true
  4. Restarting the query

The upgrade preserves all state and offset positions, enabling seamless transition to the more flexible V2 format that supports source evolution (adding/removing sources by name).

Does this PR introduce any user-facing change?

Yes. This PR introduces two new behaviors:

1. New config (default: false)

spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=false

When set to true, enables automatic V1 to V2 offset log upgrade when conditions are met.

2. New error message when upgrade needed but not enabled:

Previous behavior: Query would continue with V1 format (or fail with unclear error)

New behavior: Clear error message when V1 offset log exists, V2 requested, and upgrade config not set:

IllegalStateException: Offset log is in V1 format but V2 format was requested via 
spark.sql.streaming.offsetLog.formatVersion=2. To migrate the offset log, set 
spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true. 
Important: This is a one-way migration that cannot be rolled back. 
Ensure all batches are committed before enabling. See documentation for details.

This is a backwards compatible change - existing V1 queries continue working unchanged unless users explicitly opt into the upgrade.

How was this patch tested?

Added comprehensive test suite OffsetLogV1ToV2UpgradeSuite with the following test cases:

  1. V1 offset log + all sources named auto-upgrades to V2

    • Creates V1 offset log with unnamed sources
    • Restarts with named sources + V2 config + upgrade config
    • Verifies upgrade occurs and offsets are keyed by name
  2. V1 offset log + no sources named continues with V1

    • Verifies no upgrade when sources remain unnamed
  3. Already V2 offset log + named sources continues with V2

    • Verifies stability (no regression) for existing V2 offset logs
  4. Multi-source upgrade preserves all offsets correctly

    • Tests 3-source upgrade with names "payments", "refunds", "adjustments"
    • Verifies all offsets correctly mapped by name
  5. Source count mismatch throws clear error

    • Creates V1 offset log with 2 sources
    • Attempts upgrade with 3 sources
    • Verifies clear error message about source count mismatch
  6. V1 offset log + V2 requested without upgrade config throws clear error

    • Verifies the new error message when upgrade config not set
    • Ensures users get clear guidance on what config to set

All tests use real file-based streaming sources to ensure end-to-end correctness.

Was this patch authored or co-authored using generative AI tooling?

No.

@ericm-db ericm-db force-pushed the streaming-offset-v1-to-v2-migration branch from fb20d1d to 6f02737 Compare March 2, 2026 18:02
- Add validation in OffsetSeq.toOffsetMap to detect duplicate source names that would cause silent data loss
- Add validation in MicroBatchExecution when building V2 sourceIdMap to prevent duplicate names
- Remove unused test parameter (outputDir)
- Fix unused variables in tests (v1BatchId, v2BatchId)

These changes improve robustness by failing fast with clear error messages when duplicate source names are detected, preventing silent data loss during V1-to-V2 checkpoint upgrade or new V2 query initialization.
@ericm-db ericm-db force-pushed the streaming-offset-v1-to-v2-migration branch from 6f02737 to 9768076 Compare March 2, 2026 18:04
Add spark.sql.streaming.checkpoint.v1ToV2.autoUpgrade.enabled config
to require explicit user opt-in for V1 to V2 checkpoint migration.

Previously, the upgrade would happen automatically when users set
offsetLog.formatVersion=2 and added names to sources, which could
be surprising for production systems.

Changes:
- Add STREAMING_CHECKPOINT_V1_TO_V2_AUTO_UPGRADE_ENABLED config (default: false)
- Update MicroBatchExecution to check config and throw clear error if not enabled
- Error message explains the migration requirement and consequences
- Update all existing upgrade tests to set the new config
- Add test to verify error message when config is not set

This ensures users explicitly acknowledge the one-way, irreversible
nature of the checkpoint migration before it occurs.
@ericm-db ericm-db changed the title Streaming offset v1 to v2 migration [SPARK-XXXXX][SS] Add automatic V1 to V2 checkpoint upgrade for streaming queries with named sources Mar 2, 2026
@ericm-db ericm-db changed the title [SPARK-XXXXX][SS] Add automatic V1 to V2 checkpoint upgrade for streaming queries with named sources [SPARK-XXXXX][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources Mar 2, 2026
…inology

Update all references from "checkpoint upgrade" to "offset log upgrade"
to avoid confusion with checkpoint v2 (state store checkpointing).
This change clarifies that we're specifically upgrading the offset log
format, not the entire checkpoint structure.

Changes:
- Rename config: checkpoint.v1ToV2.autoUpgrade.enabled → offsetLog.v1ToV2.autoUpgrade.enabled
- Rename function: maybeUpgradeCheckpointToV2 → maybeUpgradeOffsetLogToV2
- Rename test suite: CheckpointV1ToV2UpgradeSuite → OffsetLogV1ToV2UpgradeSuite
- Update all error messages, comments, and test names to use "offset log"
- Update PR title and description

This is a terminology-only change with no functional modifications.
@ericm-db ericm-db changed the title [SPARK-XXXXX][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources [SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources Mar 2, 2026
ericm-db added 8 commits March 2, 2026 11:57
Previously, when a user requested V2 offset log format via
spark.sql.streaming.offsetLog.formatVersion=2 but didn't name
their sources with .name(), the system would silently ignore
the V2 request and continue using V1 format.

This commit adds a clear error message when V2 is requested
but sources aren't named, explaining that V2 format requires
all sources to have names.

Changes:
- Add validation in MicroBatchExecution to check if sources are
  named when V2 format is requested
- Throw IllegalStateException with clear guidance when unnamed
  sources are detected with V2 request
- Add test case to verify the error is thrown with correct message

Error message guides users to either:
1. Add .name() to all sources, or
2. Set offsetLog.formatVersion=1 to continue with V1 format
Enable auto-upgrade from V1 to V2 offset log format even when sources
are not named, using positional indices ("0", "1", "2") as keys.

Rationale: V1 format already assumes plan stability between batches
(index 0 must always map to the same source). The V1→V2 upgrade makes
the exact same assumption, so requiring names was unnecessarily strict.

Changes:
- Add generateSourceIds() to produce either named or positional IDs
- Remove requirement for allSourcesNamed in upgrade path
- Use named keys when available, fall back to positional indices
- Update maybeUpgradeOffsetLogToV2 to handle both cases
- Rebuild sourceIdMap for V2 regardless of naming
- Log whether upgrade used "named" or "positional" keys
- Update test to verify positional key upgrade works

Benefits:
- Users can upgrade to V2 without naming all sources
- Named keys still preferred when available (better debugging)
- No stronger assumptions than V1 already makes
- Easier migration path for existing queries

Example upgrade with unnamed sources:
V1: [paymentOffset, refundOffset]  (positional)
V2: {"0": paymentOffset, "1": refundOffset}  (positional in map)
Fix bug where V2 checkpoints could have their key scheme inadvertently
changed on restart if users added/removed source names.

Problem:
1. Upgrade V1→V2 with source evolution disabled → Creates {"0", "1"}
2. User later adds .name("my_source") to source
3. Query restart regenerates sourceIdMap with key "my_source"
4. Offset log still has key "0" → mismatch! 💥

Solution:
Read the persisted ENABLE_STREAMING_SOURCE_EVOLUTION config from the
existing V2 checkpoint metadata to determine the original key scheme:
- If source evolution was ENABLED → use named keys (allows evolution)
- If source evolution was DISABLED → use positional keys forever

This ensures the key scheme is immutable once a V2 checkpoint is created,
preventing accidental breakage.

Changes:
- Check existing V2 checkpoint's source evolution config from metadata
- Use that to determine whether to generate positional or named keys
- Add comment explaining protection against re-upgrading V2 checkpoints
- Add test verifying positional keys persist even when names added later

Example:
V2 created with source evolution disabled + unnamed sources:
  - Keys: {"0", "1"} → forever positional
  - Adding .name() later → still uses {"0", "1"}

V2 created with source evolution enabled + named sources:
  - Keys: {"payments", "refunds"} → can add/remove sources
  - Renaming sources → uses new names (evolution allowed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant