Skip to content

fix(sqs,kinesis,server): data-safety cluster — durable DDB-streams checkpoints, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination#2042

Open
vieiralucas wants to merge 1 commit into
mainfrom
wt-bug-datasafety

Conversation

@vieiralucas

@vieiralucas vieiralucas commented Jun 29, 2026

Copy link
Copy Markdown
Member

Fixes a cluster of confirmed data-safety / persistence / pagination bugs.

Fixes

  1. HIGH - DynamoDB Streams -> Lambda checkpoints are now durable. They live in DynamoDbState (persisted via the snapshot, #[serde(default)] for back-compat) instead of an in-memory RwLock, so a restart resumes from the last delivered sequence number rather than re-seeding TRIM_HORIZON and re-invoking the target Lambda with the whole retained backlog (duplicate side effects). Mirrors KinesisState.lambda_checkpoints.
  2. MEDIUM - That poller now iterates ALL Lambda accounts and resolves the table from the stream ARN's own account, so non-default / cross-account ESMs fire (was default-account-only).
  3. MEDIUM - SQS ReceiveMessage range-validates the request-level VisibilityTimeout (0..=43200) and returns InvalidParameterValue. A near-i64::MAX value previously overflowed now + Duration::seconds(v) and panicked the worker thread; a negative value made the message immediately visible.
  4. MEDIUM - SQS SetQueueAttributes validates the redrive DLQ target the same way CreateQueue does (Terraform's aws_sqs_redrive_policy uses SetQueueAttributes). A bad/nonexistent DLQ ARN no longer churns messages on the source forever.
  5. MEDIUM - SQS resumes in-progress message-move tasks left RUNNING/CANCELLING by a previous process at startup, so ListMessageMoveTasks no longer hangs at RUNNING forever and the DLQ drain continues. A task whose source queue is gone is finalized FAILED.
  6. MEDIUM - Kinesis shard records past retention are now trimmed on write (shifting index-based Lambda checkpoints and shard-iterator offsets so sequence/iterator correctness is preserved), instead of living in memory + every snapshot forever.
  7. MEDIUM - Kinesis DescribeStream honors Limit (capped at 100) / ExclusiveStartShardId and reports a real HasMoreShards; ListShards honors ShardFilter (AT_LATEST / AT_TRIM_HORIZON / AT_TIMESTAMP / AFTER_SHARD_ID / FROM_*).
  8. LOW - The /_fakecloud/dynamodb/ttl-processor/tick admin route saves the snapshot after expiring items, so the deletions survive a restart (the normal mutating path already persists).

Tests

  • DDB poller: deliver, "restart" (rebuild from the serde-round-tripped state), assert NO re-replay; a cross-account ESM fires.
  • SQS: huge + negative VisibilityTimeout rejected (no panic); SetQueueAttributes with a bad redrive ARN rejected; an orphaned RUNNING move task resumes and drains. Updated the prior unresolvable-DLQ test to delete the DLQ after config (config-time validation now applies; the runtime safety-net still holds).
  • Kinesis: records past retention trimmed with offsets shifted; DescribeStream paginates; ListShards honors a ShardFilter (incl. missing-ShardId rejection).

Validation

  • cargo build (all touched crates + server)
  • cargo fmt --all -- --check: clean
  • cargo clippy -p fakecloud-sqs -p fakecloud-kinesis -p fakecloud -p fakecloud-dynamodb --all-targets -- -D warnings: clean
  • cargo test -p fakecloud-sqs (170) / fakecloud-kinesis (123) / fakecloud-dynamodb (328): pass
  • cargo test -p fakecloud-e2e --no-run: compiles

Notes

Checkpoint durability mirrors the Kinesis poller: the checkpoint lives in the persisted state and is flushed by the next snapshot save, so there remains a small window (advance not yet snapshotted before a crash) identical to the accepted Kinesis behavior. This eliminates the catastrophic full-backlog re-replay.


Summary by cubic

Fixes multiple data-safety and pagination gaps across DynamoDB Streams, SQS, and Kinesis to prevent replays, panics, and unbounded growth. Restarts now resume correctly, SQS params are validated, Kinesis data is trimmed, and APIs align better with AWS behavior.

  • Bug Fixes
    • DynamoDB Streams → Lambda: Persist per-mapping checkpoints in DynamoDbState (snapshot-backed) so restarts resume instead of replaying TRIM_HORIZON; iterate all Lambda accounts and use the stream ARN’s account so cross-account mappings fire.
    • SQS: ReceiveMessage validates VisibilityTimeout (0..=43200) to avoid panics/early visibility; SetQueueAttributes validates redrive DLQ targets like CreateQueue; resume orphaned RUNNING/CANCELLING message-move tasks at startup and finalize FAILED when the source queue is gone.
    • Kinesis: Trim records past retention on write and shift Lambda checkpoints and shard-iterator offsets to keep them correct; DescribeStream paginates with Limit (capped at 100) and ExclusiveStartShardId; ListShards honors ShardFilter (AT_LATEST/TRIM_HORIZON/TIMESTAMP/AFTER_SHARD_ID/FROM_*).
    • Server: /_fakecloud/dynamodb/ttl-processor/tick now saves a snapshot after expirations so deletions persist across restarts.

Written for commit e193edb. Summary will update on new commits.

Review in cubic

…ts, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination

DynamoDB Streams -> Lambda poller (crates/fakecloud-server):
- Persist per-mapping checkpoints in DynamoDbState (rides the snapshot) so a
  restart resumes from the last delivered sequence instead of re-seeding
  TRIM_HORIZON and re-invoking the target Lambda with the whole backlog.
- Iterate all Lambda accounts and resolve the table from the stream ARN's
  account, so non-default / cross-account ESMs fire.

SQS:
- ReceiveMessage now range-validates the request-level VisibilityTimeout
  (0..=43200), returning InvalidParameterValue instead of panicking on an
  i64 overflow / making a message immediately visible on a negative value.
- SetQueueAttributes validates the redrive DLQ target the same way
  CreateQueue does (Terraform's aws_sqs_redrive_policy path).
- Resume in-progress message-move tasks left RUNNING/CANCELLING by a previous
  process at startup so ListMessageMoveTasks doesn't hang and the DLQ drain
  finishes (Failed when the source queue is gone).

Kinesis:
- Trim shard records past retention on write (shifting Lambda checkpoints and
  shard-iterator offsets) so they don't accumulate in memory + every snapshot.
- DescribeStream honors Limit (cap 100) / ExclusiveStartShardId and reports a
  real HasMoreShards; ListShards honors ShardFilter
  (AT_LATEST/AT_TRIM_HORIZON/AT_TIMESTAMP/AFTER_SHARD_ID/FROM_*).

Server:
- /_fakecloud/dynamodb/ttl-processor/tick saves the snapshot after expiring
  items so deletions survive a restart.

Tests: DDB poller no-replay-after-restart + cross-account fire; SQS huge/negative
VisibilityTimeout rejected, bad redrive ARN rejected, orphaned move-task resume;
Kinesis retention trim, DescribeStream pagination, ListShards ShardFilter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant