Add filtering of large record before combine/sorting by xinyu-liu-glean · Pull Request #4 · askscio/beam

xinyu-liu-glean · 2025-06-11T17:36:50Z

This patch adds a filter to drop large records that has size greater than 5MB. For customers like adobe, we saw some of the doc records can be around 100M, which will not work in the Flink Inmemory sorter. Add this step so we are able to skip these records and keep processing.

Tested in adobe evalsets and verified working.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

ranjithkumar-glean

any way to test this before trying on adobe?

ranjithkumar-glean · 2025-06-12T22:54:50Z

Xinyu's 2w progress has more than our 2y progress 🚀

xinyu-liu-glean · 2025-06-12T23:01:39Z

any way to test this before trying on adobe?

Oh, it's already tested on adobe. All the records greater than 5MB has been filtered out in the latest run :)

steve-scio

🐐

.../flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java

Add filtering of large record before combine (sorting)

de03f40

xinyu-liu-glean changed the title ~~Add filtering of large record before combine (sorting)~~ [Don't Merge] Add filtering of large record before combine (sorting) Jun 11, 2025

xinyu-liu-glean added 2 commits June 12, 2025 10:02

Update the cap to be 5MB

8f0b2a6

Unwanted changes

e3550a7

xinyu-liu-glean changed the title ~~[Don't Merge] Add filtering of large record before combine (sorting)~~ Add filtering of large record before combine/sorting Jun 12, 2025

xinyu-liu-glean requested review from ranjithkumar-glean and steve-scio June 12, 2025 22:47

ranjithkumar-glean approved these changes Jun 12, 2025

View reviewed changes

steve-scio approved these changes Jun 12, 2025

View reviewed changes

.../flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java Outdated Show resolved Hide resolved

Add more glean specific comments

611ef2f

xinyu-liu-glean merged commit 734f12a into timmy-2.59 Jun 12, 2025
1 of 2 checks passed

xinyu-liu-glean added a commit that referenced this pull request Jun 18, 2025

Add filtering of large record before combine/sorting (#4)

3afc5cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filtering of large record before combine/sorting#4

Add filtering of large record before combine/sorting#4
xinyu-liu-glean merged 4 commits intotimmy-2.59from
large-record

xinyu-liu-glean commented Jun 11, 2025 •

edited

Loading

Uh oh!

ranjithkumar-glean left a comment

Uh oh!

ranjithkumar-glean commented Jun 12, 2025 •

edited

Loading

Uh oh!

xinyu-liu-glean commented Jun 12, 2025

Uh oh!

steve-scio left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

xinyu-liu-glean commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

ranjithkumar-glean left a comment

Choose a reason for hiding this comment

Uh oh!

ranjithkumar-glean commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinyu-liu-glean commented Jun 12, 2025

Uh oh!

steve-scio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

xinyu-liu-glean commented Jun 11, 2025 •

edited

Loading

ranjithkumar-glean commented Jun 12, 2025 •

edited

Loading