Skip to content

Conversation

@ATHARVA262005
Copy link

Description

This PR addresses a scalability bottleneck in SparkReceiverIO where data reading was previously restricted to a single worker.

Previously, the Read transform used Impulse.create(), which limited the downstream Splittable DoFn execution to a single initial restriction. This change introduces proper parallelization logic, allowing users to configure multiple concurrent readers.

Key Changes:

  • Added withNumReaders(int): A new configuration option to the SparkReceiverIO builder.
  • Parallel Work Distribution: Implemented Create.of(shards) + Reshuffle pattern in expand(). When numReaders > 1, this forces the distribution of reading tasks across available Beam workers.
  • DoFn Refactoring: Updated ReadFromSparkReceiverWithOffsetDoFn to accept explicit Integer shard identifiers instead of generic byte[].
  • Backward Compatibility: Preserved strict backward compatibility. If numReaders is partially set or defaults to 1, the IO mimics the original behavior exactly.

Issue Links

Fixes #37410

Type of change

  • New feature

Tests

I have added a new unit test testReadFromCustomReceiverWithParallelism to SparkReceiverIOTest.java which verifies that:

  1. The IO correctly initializes with multiple readers (configured to 3).
  2. All records are read without duplication or loss (verified 60 total records from 3 simulated sources).

Running the tests:

./gradlew :sdks:java:io:sparkreceiver:3:test

------------------------

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

 - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
 - [ ] Update `CHANGES.md` with noteworthy changes.
 - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier).

To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)

GitHub Actions Tests Status (on master branch)
------------------------------------------------------------------------------------------------
[![Build python source distribution and wheels](https://github.com/apache/beam/actions/workflows/build_wheels.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
[![Python tests](https://github.com/apache/beam/actions/workflows/python_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
[![Java tests](https://github.com/apache/beam/actions/workflows/java_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
[![Go tests](https://github.com/apache/beam/actions/workflows/go_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)

See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows.

…Readers

Adds withNumReaders() to SparkReceiverIO and implements parallel execution using Create.of(shards) + Reshuffle. This addresses scalability limitations by allowing work distribution across multiple workers.
Copilot AI review requested due to automatic review settings January 24, 2026 20:34
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ATHARVA262005, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the scalability of SparkReceiverIO by introducing a mechanism for parallel data reading. It moves away from a single-worker limitation by allowing users to configure the number of concurrent readers, effectively distributing the data ingestion workload across multiple Beam workers. This change improves throughput and resource utilization for Spark-based data sources within Beam pipelines, while maintaining compatibility with existing configurations.

Highlights

  • Parallel Reading Implementation: Introduces parallel reading capability to SparkReceiverIO to address a single-worker bottleneck, significantly improving scalability.
  • New Configuration Option: Adds a new withNumReaders(int) configuration option to the SparkReceiverIO builder, allowing users to specify the number of concurrent readers.
  • Parallel Work Distribution: Implements a Create.of(shards) and Reshuffle pattern in the expand() method to distribute reading tasks across multiple Beam workers when numReaders is greater than 1.
  • DoFn Refactoring: Refactors ReadFromSparkReceiverWithOffsetDoFn to accept Integer shard identifiers instead of byte[], enhancing type safety and integration with the new parallelization logic.
  • Backward Compatibility: Ensures strict backward compatibility, reverting to single-worker behavior if numReaders is not set or explicitly set to 1.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to add parallel reading capability to SparkReceiverIO to address a scalability bottleneck where data reading was previously restricted to a single worker. The implementation adds a withNumReaders(int) configuration option and changes the internal architecture from using Impulse.create() to using Create.of(shards) + Reshuffle for distributing work across multiple workers.

Changes:

  • Added withNumReaders(int) builder method to configure the number of concurrent readers
  • Modified ReadFromSparkReceiverWithOffsetDoFn to accept Integer shard identifiers instead of byte[]
  • Implemented parallel work distribution using Create.of(shards) + Reshuffle pattern when numReaders > 1
  • Added test case testReadFromCustomReceiverWithParallelism to verify parallel reading behavior
  • Updated CHANGES.md to document the new feature

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
SparkReceiverIO.java Added numReaders configuration, modified expand() method to support parallel reading with sharding, changed DoFn input type from byte[] to Integer
ReadFromSparkReceiverWithOffsetDoFn.java Updated DoFn signature to accept Integer element type instead of byte[], reformatted some comments
SparkReceiverIOTest.java Added new test case for parallel reading with 3 readers
ReadFromSparkReceiverWithOffsetDoFnTest.java Updated test element type from byte[] to Integer
CHANGES.md Added entry documenting parallel reading support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +171 to +173
* A number of workers to read from Spark {@link Receiver}.
*
* <p>If this value is not set, or set to 1, the reading will be performed on a single worker.
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for withNumReaders claims it will allow reading from multiple workers, but it doesn't explain that each reader will independently read all data, resulting in duplicates. The documentation should clearly state the expected behavior - whether it's intended to duplicate data for redundancy or to distribute work without duplication. Currently, the behavior doesn't match the stated goal of addressing a "scalability bottleneck" since duplicating data doesn't improve scalability.

Suggested change
* A number of workers to read from Spark {@link Receiver}.
*
* <p>If this value is not set, or set to 1, the reading will be performed on a single worker.
* Configures how many independent workers (readers) will read from the same Spark
* {@link Receiver}.
*
* <p>Each configured reader connects to the underlying source independently and will
* typically observe the full stream of data. As a result, records may be duplicated
* across readers; this option does <b>not</b> shard or partition the input among workers.
*
* <p>This setting is intended for use cases where redundant consumption of the same data
* is acceptable (for example, to increase robustness when dealing with flaky sources),
* and should not be used as a mechanism for load-balancing or avoiding scalability
* bottlenecks via input partitioning. If you require a single logical read without
* duplicates, leave {@code numReaders} at its default of {@code 1} and apply your own
* partitioning or deduplication to the resulting {@link PCollection}.
*
* <p>If this value is not set, or set to {@code 1}, the reading will be performed on a
* single worker.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I have refactored the implementation to support proper sharding.

  1. I added setShard(int shardId, int numShards) to the HasOffset interface.
  2. The DoFn now passes the unique shard ID to the Receiver via setShard.
  3. I updated the documentation to clarify that the receiver is expected to handle partitioning based on these parameters.

checkArgument(startPollTimeoutSec != null, "Start poll timeout can not be null");
return toBuilder().setStartPollTimeoutSec(startPollTimeoutSec).build();
}

Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc comment for withStartOffset was removed, creating inconsistency with the other configuration methods (withPullFrequencySec, withStartPollTimeoutSec, etc.) which all have Javadoc comments. The comment should be restored to maintain documentation consistency across the API.

Suggested change
/** Inclusive start offset from which the reading should be started. */

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I have restored the Javadoc for withStartOffset to match the consistency of the other methods.

Comment on lines 267 to +270
public ProcessContinuation processElement(
@Element byte[] element,
@Element Integer element,
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The element parameter (representing the shard ID) is never used in the processElement method. This means that when multiple readers are configured (via withNumReaders), each DoFn instance will independently create a SparkReceiver starting from the same startOffset, resulting in duplicate data being read. For example, with 3 readers, the same 20 records will be read 3 times, producing 60 total records with duplicates.

This defeats the purpose of parallel reading for scalability. The shard ID should be used to either:

  1. Partition the offset range among readers (e.g., reader 0 reads offsets 0-6, reader 1 reads 7-13, reader 2 reads 14-19), or
  2. Use the shard ID to configure the Spark Receiver to read from different partitions/sources

Without this coordination, the feature creates duplicate data rather than distributing work.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The processElement method now uses the element (Shard ID) and passes it to the receiver:

if (sparkReceiver instanceof HasOffset) {
  ((HasOffset) sparkReceiver).setShard(element, numReaders);
}

Comment on lines 261 to 264
for (int j = 0; j < 3; j++) {
for (int i = 0; i < CustomReceiverWithOffset.RECORDS_COUNT; i++) {
expected.add(String.valueOf(i));
}
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expects 60 records (3 readers * 20 records each), which means it's testing for duplicate data rather than proper parallel reading. This test validates the buggy behavior where each reader independently reads all data from the source. The test should instead verify that 20 unique records are read once, distributed across the 3 readers without duplication.

Suggested change
for (int j = 0; j < 3; j++) {
for (int i = 0; i < CustomReceiverWithOffset.RECORDS_COUNT; i++) {
expected.add(String.valueOf(i));
}
for (int i = 0; i < CustomReceiverWithOffset.RECORDS_COUNT; i++) {
expected.add(String.valueOf(i));

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have updated CustomReceiverWithOffset to implement the setShard logic (using simple modulo partitioning).
I also updated the test assertion to expect 20 unique records total strings '0' to '19' exactly once), verifying that the work is correctly distributed without duplication.

Comment on lines 217 to 226
.apply(Impulse.create())
.apply(
MapElements.into(TypeDescriptors.integers())
.via(
new SerializableFunction<byte[], Integer>() {
@Override
public Integer apply(byte[] input) {
return 0;
}
}))
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backward compatibility path introduces unnecessary complexity by converting the Impulse output (byte[]) to Integer via MapElements. This conversion is not needed - you could directly use Impulse.create() followed by MapElements to produce an Integer shard ID of 0, or better yet, use Create.of(0) directly for consistency with the multi-reader path. The current approach adds an extra transformation step without clear benefit.

Suggested change
.apply(Impulse.create())
.apply(
MapElements.into(TypeDescriptors.integers())
.via(
new SerializableFunction<byte[], Integer>() {
@Override
public Integer apply(byte[] input) {
return 0;
}
}))
.apply(Create.of(0))

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I removed the Impulse + MapElements chain and replaced it with Create.of(0) as suggested. This is cleaner and consistent with the multi-reader path.

1. Implemented setShard() in HasOffset to allow receivers to handle partitioning (prevents data duplication).
2. Updated CustomReceiverWithOffset to filter records based on shardId.
3. Updated DoFn to pass shardId/numShards to the receiver.
4. Restored Javadoc for withStartOffset.
5. Simplified backward compatibility logic in expand() using Create.of().
6. Updated tests to verify parallel reading produces correct, unique record set.
@github-actions
Copy link
Contributor

Assigning reviewers:

R: @Abacn for label java.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Parallel reading support in SparkReceiverIO

1 participant