Skip to content

Add filtering of large record before combine/sorting#4

Merged
xinyu-liu-glean merged 4 commits intotimmy-2.59from
large-record
Jun 12, 2025
Merged

Add filtering of large record before combine/sorting#4
xinyu-liu-glean merged 4 commits intotimmy-2.59from
large-record

Conversation

@xinyu-liu-glean
Copy link

@xinyu-liu-glean xinyu-liu-glean commented Jun 11, 2025

This patch adds a filter to drop large records that has size greater than 5MB. For customers like adobe, we saw some of the doc records can be around 100M, which will not work in the Flink Inmemory sorter. Add this step so we are able to skip these records and keep processing.

Tested in adobe evalsets and verified working.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@xinyu-liu-glean xinyu-liu-glean changed the title Add filtering of large record before combine (sorting) [Don't Merge] Add filtering of large record before combine (sorting) Jun 11, 2025
@xinyu-liu-glean xinyu-liu-glean changed the title [Don't Merge] Add filtering of large record before combine (sorting) Add filtering of large record before combine/sorting Jun 12, 2025
Copy link

@ranjithkumar-glean ranjithkumar-glean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to test this before trying on adobe?

@ranjithkumar-glean
Copy link

ranjithkumar-glean commented Jun 12, 2025

Xinyu's 2w progress has more than our 2y progress 🚀

@xinyu-liu-glean
Copy link
Author

any way to test this before trying on adobe?

Oh, it's already tested on adobe. All the records greater than 5MB has been filtered out in the latest run :)

Copy link

@steve-scio steve-scio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐐

@xinyu-liu-glean xinyu-liu-glean merged commit 734f12a into timmy-2.59 Jun 12, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments