Skip to content

Add streaming merge-join for remote sync sources#3461

Open
pbanakar-microsoft wants to merge 3 commits into
mover/c2c-stagefrom
users/pbanakar/streaming-merge-join
Open

Add streaming merge-join for remote sync sources#3461
pbanakar-microsoft wants to merge 3 commits into
mover/c2c-stagefrom
users/pbanakar/streaming-merge-join

Conversation

@pbanakar-microsoft
Copy link
Copy Markdown
Collaborator

  • New syncMergeJoin.go: O(1) memory streaming merge-join for Blob/S3/BlobFS sources that guarantee lexicographic listing order
  • Disable memory/file/goroutine throttling for merge-join path to avoid ReadMemStats STW bottleneck
  • Set inner EnumerationParallelism=1 (outer crawl provides parallelism)
  • Default merge-join parallelism: 500 (configurable via AZCOPY_MERGE_JOIN_PARALLELISM)
  • CrawlWithStats: expose live ActiveWorkers/QueuedDirs counters
  • Fix syncComparator: both-zero change times no longer flags metadata changed
  • Add diagnostic [STEP]/[SLOW-STEP] logging for performance analysis

Description

  • Feature / Bug Fix: (Brief description of the feature or issue being addressed)

  • Related Links:

  • Issues

  • Team thread

  • Documents

  • [Email Subject]

Type of Change

  • Bug fix
  • New feature
  • Documentation update required
  • Code quality improvement
  • Other (describe):

How Has This Been Tested?

Thank you for your contribution to AzCopy!

- New syncMergeJoin.go: O(1) memory streaming merge-join for Blob/S3/BlobFS
  sources that guarantee lexicographic listing order
- Disable memory/file/goroutine throttling for merge-join path to avoid
  ReadMemStats STW bottleneck
- Set inner EnumerationParallelism=1 (outer crawl provides parallelism)
- Default merge-join parallelism: 500 (configurable via AZCOPY_MERGE_JOIN_PARALLELISM)
- CrawlWithStats: expose live ActiveWorkers/QueuedDirs counters
- Fix syncComparator: both-zero change times no longer flags metadata changed
- Add diagnostic [STEP]/[SLOW-STEP] logging for performance analysis
- Add isSelfReferentialDirSentinel() to detect BlobFS/GCP directory
  sentinels with empty relativePath
- Schedule ACL copy transfers for sentinels instead of silently skipping
- Prevent re-enqueueing sentinel dirs (avoids infinite loops)
- Reduce mergeJoinChannelBufferSize from 10K to 1K (430KB vs 8MB per channel)
- Track originalRelativePath before buildChildPath rewrites it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant