fix(sqs,kinesis,server): data-safety cluster — durable DDB-streams checkpoints, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination#2042
Open
vieiralucas wants to merge 1 commit into
Conversation
…ts, SQS redrive/visibility validation + move-task resume, Kinesis retention trim + shard pagination DynamoDB Streams -> Lambda poller (crates/fakecloud-server): - Persist per-mapping checkpoints in DynamoDbState (rides the snapshot) so a restart resumes from the last delivered sequence instead of re-seeding TRIM_HORIZON and re-invoking the target Lambda with the whole backlog. - Iterate all Lambda accounts and resolve the table from the stream ARN's account, so non-default / cross-account ESMs fire. SQS: - ReceiveMessage now range-validates the request-level VisibilityTimeout (0..=43200), returning InvalidParameterValue instead of panicking on an i64 overflow / making a message immediately visible on a negative value. - SetQueueAttributes validates the redrive DLQ target the same way CreateQueue does (Terraform's aws_sqs_redrive_policy path). - Resume in-progress message-move tasks left RUNNING/CANCELLING by a previous process at startup so ListMessageMoveTasks doesn't hang and the DLQ drain finishes (Failed when the source queue is gone). Kinesis: - Trim shard records past retention on write (shifting Lambda checkpoints and shard-iterator offsets) so they don't accumulate in memory + every snapshot. - DescribeStream honors Limit (cap 100) / ExclusiveStartShardId and reports a real HasMoreShards; ListShards honors ShardFilter (AT_LATEST/AT_TRIM_HORIZON/AT_TIMESTAMP/AFTER_SHARD_ID/FROM_*). Server: - /_fakecloud/dynamodb/ttl-processor/tick saves the snapshot after expiring items so deletions survive a restart. Tests: DDB poller no-replay-after-restart + cross-account fire; SQS huge/negative VisibilityTimeout rejected, bad redrive ARN rejected, orphaned move-task resume; Kinesis retention trim, DescribeStream pagination, ListShards ShardFilter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes a cluster of confirmed data-safety / persistence / pagination bugs.
Fixes
DynamoDbState(persisted via the snapshot,#[serde(default)]for back-compat) instead of an in-memoryRwLock, so a restart resumes from the last delivered sequence number rather than re-seeding TRIM_HORIZON and re-invoking the target Lambda with the whole retained backlog (duplicate side effects). MirrorsKinesisState.lambda_checkpoints.ReceiveMessagerange-validates the request-levelVisibilityTimeout(0..=43200) and returnsInvalidParameterValue. A near-i64::MAXvalue previously overflowednow + Duration::seconds(v)and panicked the worker thread; a negative value made the message immediately visible.SetQueueAttributesvalidates the redrive DLQ target the same wayCreateQueuedoes (Terraform'saws_sqs_redrive_policyuses SetQueueAttributes). A bad/nonexistent DLQ ARN no longer churns messages on the source forever.ListMessageMoveTasksno longer hangs at RUNNING forever and the DLQ drain continues. A task whose source queue is gone is finalized FAILED.DescribeStreamhonorsLimit(capped at 100) /ExclusiveStartShardIdand reports a realHasMoreShards;ListShardshonorsShardFilter(AT_LATEST / AT_TRIM_HORIZON / AT_TIMESTAMP / AFTER_SHARD_ID / FROM_*)./_fakecloud/dynamodb/ttl-processor/tickadmin route saves the snapshot after expiring items, so the deletions survive a restart (the normal mutating path already persists).Tests
VisibilityTimeoutrejected (no panic);SetQueueAttributeswith a bad redrive ARN rejected; an orphaned RUNNING move task resumes and drains. Updated the prior unresolvable-DLQ test to delete the DLQ after config (config-time validation now applies; the runtime safety-net still holds).DescribeStreampaginates;ListShardshonors aShardFilter(incl. missing-ShardId rejection).Validation
Notes
Checkpoint durability mirrors the Kinesis poller: the checkpoint lives in the persisted state and is flushed by the next snapshot save, so there remains a small window (advance not yet snapshotted before a crash) identical to the accepted Kinesis behavior. This eliminates the catastrophic full-backlog re-replay.
Summary by cubic
Fixes multiple data-safety and pagination gaps across DynamoDB Streams, SQS, and Kinesis to prevent replays, panics, and unbounded growth. Restarts now resume correctly, SQS params are validated, Kinesis data is trimmed, and APIs align better with AWS behavior.
DynamoDbState(snapshot-backed) so restarts resume instead of replaying TRIM_HORIZON; iterate all Lambda accounts and use the stream ARN’s account so cross-account mappings fire.ReceiveMessagevalidatesVisibilityTimeout(0..=43200) to avoid panics/early visibility;SetQueueAttributesvalidates redrive DLQ targets likeCreateQueue; resume orphaned RUNNING/CANCELLING message-move tasks at startup and finalize FAILED when the source queue is gone.DescribeStreampaginates withLimit(capped at 100) andExclusiveStartShardId;ListShardshonorsShardFilter(AT_LATEST/TRIM_HORIZON/TIMESTAMP/AFTER_SHARD_ID/FROM_*)./_fakecloud/dynamodb/ttl-processor/ticknow saves a snapshot after expirations so deletions persist across restarts.Written for commit e193edb. Summary will update on new commits.