fix(sqs,kinesis,server): data-safety cluster — durable DDB-streams checkpoints, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination by vieiralucas · Pull Request #2042 · faiscadev/fakecloud

vieiralucas · 2026-06-29T01:26:08Z

Fixes a cluster of confirmed data-safety / persistence / pagination bugs.

Fixes

HIGH - DynamoDB Streams -> Lambda checkpoints are now durable. They live in DynamoDbState (persisted via the snapshot, #[serde(default)] for back-compat) instead of an in-memory RwLock, so a restart resumes from the last delivered sequence number rather than re-seeding TRIM_HORIZON and re-invoking the target Lambda with the whole retained backlog (duplicate side effects). Mirrors KinesisState.lambda_checkpoints.
MEDIUM - That poller now iterates ALL Lambda accounts and resolves the table from the stream ARN's own account, so non-default / cross-account ESMs fire (was default-account-only).
MEDIUM - SQS ReceiveMessage range-validates the request-level VisibilityTimeout (0..=43200) and returns InvalidParameterValue. A near-i64::MAX value previously overflowed now + Duration::seconds(v) and panicked the worker thread; a negative value made the message immediately visible.
MEDIUM - SQS SetQueueAttributes validates the redrive DLQ target the same way CreateQueue does (Terraform's aws_sqs_redrive_policy uses SetQueueAttributes). A bad/nonexistent DLQ ARN no longer churns messages on the source forever.
MEDIUM - SQS resumes in-progress message-move tasks left RUNNING/CANCELLING by a previous process at startup, so ListMessageMoveTasks no longer hangs at RUNNING forever and the DLQ drain continues. A task whose source queue is gone is finalized FAILED.
MEDIUM - Kinesis shard records past retention are now trimmed on write (shifting index-based Lambda checkpoints and shard-iterator offsets so sequence/iterator correctness is preserved), instead of living in memory + every snapshot forever.
MEDIUM - Kinesis DescribeStream honors Limit (capped at 100) / ExclusiveStartShardId and reports a real HasMoreShards; ListShards honors ShardFilter (AT_LATEST / AT_TRIM_HORIZON / AT_TIMESTAMP / AFTER_SHARD_ID / FROM_*).
LOW - The /_fakecloud/dynamodb/ttl-processor/tick admin route saves the snapshot after expiring items, so the deletions survive a restart (the normal mutating path already persists).

Tests

DDB poller: deliver, "restart" (rebuild from the serde-round-tripped state), assert NO re-replay; a cross-account ESM fires.
SQS: huge + negative VisibilityTimeout rejected (no panic); SetQueueAttributes with a bad redrive ARN rejected; an orphaned RUNNING move task resumes and drains. Updated the prior unresolvable-DLQ test to delete the DLQ after config (config-time validation now applies; the runtime safety-net still holds).
Kinesis: records past retention trimmed with offsets shifted; DescribeStream paginates; ListShards honors a ShardFilter (incl. missing-ShardId rejection).

Validation

cargo build (all touched crates + server)
cargo fmt --all -- --check: clean
cargo clippy -p fakecloud-sqs -p fakecloud-kinesis -p fakecloud -p fakecloud-dynamodb --all-targets -- -D warnings: clean
cargo test -p fakecloud-sqs (170) / fakecloud-kinesis (123) / fakecloud-dynamodb (328): pass
cargo test -p fakecloud-e2e --no-run: compiles

Notes

Checkpoint durability mirrors the Kinesis poller: the checkpoint lives in the persisted state and is flushed by the next snapshot save, so there remains a small window (advance not yet snapshotted before a crash) identical to the accepted Kinesis behavior. This eliminates the catastrophic full-backlog re-replay.

Summary by cubic

Fixes multiple data-safety and pagination gaps across DynamoDB Streams, SQS, and Kinesis to prevent replays, panics, and unbounded growth. Restarts now resume correctly, SQS params are validated, Kinesis data is trimmed, and APIs align better with AWS behavior.

Bug Fixes
- DynamoDB Streams → Lambda: Persist per-mapping checkpoints in DynamoDbState (snapshot-backed) so restarts resume instead of replaying TRIM_HORIZON; iterate all Lambda accounts and use the stream ARN’s account so cross-account mappings fire.
- SQS: ReceiveMessage validates VisibilityTimeout (0..=43200) to avoid panics/early visibility; SetQueueAttributes validates redrive DLQ targets like CreateQueue; resume orphaned RUNNING/CANCELLING message-move tasks at startup and finalize FAILED when the source queue is gone.
- Kinesis: Trim records past retention on write and shift Lambda checkpoints and shard-iterator offsets to keep them correct; DescribeStream paginates with Limit (capped at 100) and ExclusiveStartShardId; ListShards honors ShardFilter (AT_LATEST/TRIM_HORIZON/TIMESTAMP/AFTER_SHARD_ID/FROM_*).
- Server: /_fakecloud/dynamodb/ttl-processor/tick now saves a snapshot after expirations so deletions persist across restarts.

^{Written for commit e193edb. Summary will update on new commits.}

…ts, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination DynamoDB Streams -> Lambda poller (crates/fakecloud-server): - Persist per-mapping checkpoints in DynamoDbState (rides the snapshot) so a restart resumes from the last delivered sequence instead of re-seeding TRIM_HORIZON and re-invoking the target Lambda with the whole backlog. - Iterate all Lambda accounts and resolve the table from the stream ARN's account, so non-default / cross-account ESMs fire. SQS: - ReceiveMessage now range-validates the request-level VisibilityTimeout (0..=43200), returning InvalidParameterValue instead of panicking on an i64 overflow / making a message immediately visible on a negative value. - SetQueueAttributes validates the redrive DLQ target the same way CreateQueue does (Terraform's aws_sqs_redrive_policy path). - Resume in-progress message-move tasks left RUNNING/CANCELLING by a previous process at startup so ListMessageMoveTasks doesn't hang and the DLQ drain finishes (Failed when the source queue is gone). Kinesis: - Trim shard records past retention on write (shifting Lambda checkpoints and shard-iterator offsets) so they don't accumulate in memory + every snapshot. - DescribeStream honors Limit (cap 100) / ExclusiveStartShardId and reports a real HasMoreShards; ListShards honors ShardFilter (AT_LATEST/AT_TRIM_HORIZON/AT_TIMESTAMP/AFTER_SHARD_ID/FROM_*). Server: - /_fakecloud/dynamodb/ttl-processor/tick saves the snapshot after expiring items so deletions survive a restart. Tests: DDB poller no-replay-after-restart + cross-account fire; SQS huge/negative VisibilityTimeout rejected, bad redrive ARN rejected, orphaned move-task resume; Kinesis retention trim, DescribeStream pagination, ListShards ShardFilter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sqs,kinesis,server): data-safety cluster — durable DDB-streams checkpoints, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination#2042

fix(sqs,kinesis,server): data-safety cluster — durable DDB-streams checkpoints, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination#2042
vieiralucas wants to merge 1 commit into
mainfrom
wt-bug-datasafety

vieiralucas commented Jun 29, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vieiralucas commented Jun 29, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Tests

Validation

Notes

Summary by cubic

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vieiralucas commented Jun 29, 2026 •

edited by cubic-dev-ai Bot

Loading