Add filtering of large record before combine/sorting#4
Merged
xinyu-liu-glean merged 4 commits intotimmy-2.59from Jun 12, 2025
Merged
Add filtering of large record before combine/sorting#4xinyu-liu-glean merged 4 commits intotimmy-2.59from
xinyu-liu-glean merged 4 commits intotimmy-2.59from
Conversation
ranjithkumar-glean
approved these changes
Jun 12, 2025
ranjithkumar-glean
left a comment
There was a problem hiding this comment.
any way to test this before trying on adobe?
|
Xinyu's 2w progress has more than our 2y progress 🚀 |
Author
Oh, it's already tested on adobe. All the records greater than 5MB has been filtered out in the latest run :) |
steve-scio
approved these changes
Jun 12, 2025
.../flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java
Outdated
Show resolved
Hide resolved
xinyu-liu-glean
added a commit
that referenced
this pull request
Jun 18, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch adds a filter to drop large records that has size greater than 5MB. For customers like adobe, we saw some of the doc records can be around 100M, which will not work in the Flink Inmemory sorter. Add this step so we are able to skip these records and keep processing.
Tested in adobe evalsets and verified working.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.