Skip to content

feat: store state and combined snapshot+state in S3 with configurable retention#121

Merged
marcus-girolneto merged 1 commit into
developfrom
feat/s3-state-and-combined-storage
Apr 23, 2026
Merged

feat: store state and combined snapshot+state in S3 with configurable retention#121
marcus-girolneto merged 1 commit into
developfrom
feat/s3-state-and-combined-storage

Conversation

@marcus-girolneto

Copy link
Copy Markdown
Member

Summary

Snapshot-streaming currently uploads only the raw incremental snapshot to S3. That is not sufficient to support node rollback: both the GlobalSnapshotInfo (state) and the combined snapshot+state bundle are required to validate and restore. Two production incidents — nodes missing snapshots, and the streaming DB getting stuck on inserts — both benefit from having the last N states reliably retrievable from S3.

  • Uploads state (GlobalSnapshotInfo) and combined (SnapshotWithState) to S3 alongside the existing snapshot upload. Each upload is independently toggled.
  • Keeps the last retentionCount of each (default 1000) via exact-prefix pruning on the expired ordinal — O(1) per upload, no full bucket scan.
  • Key scheme states/{ordinal}-{hash} and combined/{ordinal}-{hash}: ordinal enables cheap retention, hash prevents fork collisions. S3 object user-metadata carries both ordinal and hash for tooling.
  • Payloads are gzipped JSON (state is ~9.5 MB uncompressed).
  • downloadState(ordinal, hash) / downloadCombined(ordinal, hash) exposed on S3DAO for future rollback verification callers.

Config

New fields under snapshotStreaming.s3 (all default to off / safe values):

s3 {
  uploadEnabled = false          # existing — raw snapshot
  uploadStateEnabled = false     # new — GlobalSnapshotInfo
  uploadCombinedEnabled = false  # new — SnapshotWithState (rollback bundle)
  retentionCount = 1000          # new — lastN kept for state + combined
}

Files changed

  • src/main/scala/org/constellation/snapshotstreaming/Configuration.scala
  • src/main/scala/org/constellation/snapshotstreaming/s3/S3DAO.scala
  • src/main/scala/org/constellation/snapshotstreaming/SnapshotProcessor.scala
  • src/main/resources/reference.conf

Out of scope / follow-ups

  • Rollback verification call site. downloadState / downloadCombined are exposed but no consumer wires them in. The ticket's "ensure via S3 download that state/ordinal/hash match" is presumably handled by the network-monitoring service that triggers the rollback.
  • Backfill of pre-existing ordinals. Retention only kicks in on new uploads, so flipping the flags on a running stream at ordinal N populates S3 forward only; nothing below N - retentionCount is ever written. Matches "store lastN going forward." Happy to add a one-shot backfill if needed.

Test plan

  • CI passes (develop pulls a snapshot tessellation artifact that is not locally cached; CI resolves from the internal repo)
  • Deploy to a non-production env with uploadStateEnabled = true and uploadCombinedEnabled = true, retentionCount = 100 for quick verification
  • Confirm states/{ordinal}-{hash} and combined/{ordinal}-{hash} objects appear with correct user-metadata
  • After streaming >100 snapshots, confirm the oldest ordinals are pruned from both prefixes
  • Download a combined object, gunzip, confirm JSON round-trips into SnapshotWithState and that ordinal/hash match the object key

… retention

Currently only the raw incremental snapshot is uploaded to S3. That is not enough
to support rollback of nodes back to snapshot-streaming: both the GlobalSnapshotInfo
(state) and the combined snapshot+state bundle are required to validate and restore.

Changes:
- S3Config gains uploadStateEnabled, uploadCombinedEnabled and retentionCount.
- S3DAO: new uploadState / uploadCombined (gzipped JSON, user metadata carries
  ordinal and hash) and matching download + prune operations.
- Key scheme states/{ordinal}-{hash} and combined/{ordinal}-{hash} preserves the
  hash so forked snapshots at the same ordinal do not overwrite each other,
  while the ordinal prefix keeps retention pruning O(1) per upload.
- SnapshotProcessor.storeInS3 runs the three uploads in parallel (each gated by
  its own flag) and prunes all variants at ordinal - retentionCount after upload.
@marcus-girolneto marcus-girolneto merged commit b25e33d into develop Apr 23, 2026
1 check failed
@marcus-girolneto marcus-girolneto deleted the feat/s3-state-and-combined-storage branch April 23, 2026 13:31
@gclaramunt

Copy link
Copy Markdown
Contributor

LGTM

marcus-girolneto added a commit that referenced this pull request Apr 23, 2026
… retention (#121)

Currently only the raw incremental snapshot is uploaded to S3. That is not enough
to support rollback of nodes back to snapshot-streaming: both the GlobalSnapshotInfo
(state) and the combined snapshot+state bundle are required to validate and restore.

Changes:
- S3Config gains uploadStateEnabled, uploadCombinedEnabled and retentionCount.
- S3DAO: new uploadState / uploadCombined (gzipped JSON, user metadata carries
  ordinal and hash) and matching download + prune operations.
- Key scheme states/{ordinal}-{hash} and combined/{ordinal}-{hash} preserves the
  hash so forked snapshots at the same ordinal do not overwrite each other,
  while the ordinal prefix keeps retention pruning O(1) per upload.
- SnapshotProcessor.storeInS3 runs the three uploads in parallel (each gated by
  its own flag) and prunes all variants at ordinal - retentionCount after upload.
marcus-girolneto added a commit that referenced this pull request Apr 23, 2026
… retention (#121)

Currently only the raw incremental snapshot is uploaded to S3. That is not enough
to support rollback of nodes back to snapshot-streaming: both the GlobalSnapshotInfo
(state) and the combined snapshot+state bundle are required to validate and restore.

Changes:
- S3Config gains uploadStateEnabled, uploadCombinedEnabled and retentionCount.
- S3DAO: new uploadState / uploadCombined (gzipped JSON, user metadata carries
  ordinal and hash) and matching download + prune operations.
- Key scheme states/{ordinal}-{hash} and combined/{ordinal}-{hash} preserves the
  hash so forked snapshots at the same ordinal do not overwrite each other,
  while the ordinal prefix keeps retention pruning O(1) per upload.
- SnapshotProcessor.storeInS3 runs the three uploads in parallel (each gated by
  its own flag) and prunes all variants at ordinal - retentionCount after upload.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants