feat(io): Implement parallel reading in SparkReceiverIO (Fixes #37410) #37411

ATHARVA262005 · 2026-01-24T20:34:25Z

Description

This PR addresses a scalability bottleneck in SparkReceiverIO where data reading was previously restricted to a single worker.

Previously, the Read transform used Impulse.create(), which limited the downstream Splittable DoFn execution to a single initial restriction. This change introduces proper parallelization logic, allowing users to configure multiple concurrent readers.

Key Changes:

Added withNumReaders(int): A new configuration option to the SparkReceiverIO builder.
Parallel Work Distribution: Implemented Create.of(shards) + Reshuffle pattern in expand(). When numReaders > 1, this forces the distribution of reading tasks across available Beam workers.
DoFn Refactoring: Updated ReadFromSparkReceiverWithOffsetDoFn to accept explicit Integer shard identifiers instead of generic byte[].
Backward Compatibility: Preserved strict backward compatibility. If numReaders is partially set or defaults to 1, the IO mimics the original behavior exactly.

Issue Links

Fixes #37410

Type of change

New feature

Tests

I have added a new unit test testReadFromCustomReceiverWithParallelism to SparkReceiverIOTest.java which verifies that:

The IO correctly initializes with multiple readers (configured to 3).
All records are read without duplication or loss (verified 60 total records from 3 simulated sources).

Running the tests:

./gradlew :sdks:java:io:sparkreceiver:3:test

------------------------

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

 - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
 - [ ] Update `CHANGES.md` with noteworthy changes.
 - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier).

To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)

GitHub Actions Tests Status (on master branch)
------------------------------------------------------------------------------------------------
[![Build python source distribution and wheels](https://github.com/apache/beam/actions/workflows/build_wheels.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
[![Python tests](https://github.com/apache/beam/actions/workflows/python_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
[![Java tests](https://github.com/apache/beam/actions/workflows/java_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
[![Go tests](https://github.com/apache/beam/actions/workflows/go_tests.yml/badge.svg?event=schedule&&?branch=master)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)

See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows.

…Readers Adds withNumReaders() to SparkReceiverIO and implements parallel execution using Create.of(shards) + Reshuffle. This addresses scalability limitations by allowing work distribution across multiple workers.

gemini-code-assist · 2026-01-24T20:34:44Z

Summary of Changes

Hello @ATHARVA262005, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the scalability of SparkReceiverIO by introducing a mechanism for parallel data reading. It moves away from a single-worker limitation by allowing users to configure the number of concurrent readers, effectively distributing the data ingestion workload across multiple Beam workers. This change improves throughput and resource utilization for Spark-based data sources within Beam pipelines, while maintaining compatibility with existing configurations.

Highlights

Parallel Reading Implementation: Introduces parallel reading capability to SparkReceiverIO to address a single-worker bottleneck, significantly improving scalability.
New Configuration Option: Adds a new withNumReaders(int) configuration option to the SparkReceiverIO builder, allowing users to specify the number of concurrent readers.
Parallel Work Distribution: Implements a Create.of(shards) and Reshuffle pattern in the expand() method to distribute reading tasks across multiple Beam workers when numReaders is greater than 1.
DoFn Refactoring: Refactors ReadFromSparkReceiverWithOffsetDoFn to accept Integer shard identifiers instead of byte[], enhancing type safety and integration with the new parallelization logic.
Backward Compatibility: Ensures strict backward compatibility, reverting to single-worker behavior if numReaders is not set or explicitly set to 1.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR attempts to add parallel reading capability to SparkReceiverIO to address a scalability bottleneck where data reading was previously restricted to a single worker. The implementation adds a withNumReaders(int) configuration option and changes the internal architecture from using Impulse.create() to using Create.of(shards) + Reshuffle for distributing work across multiple workers.

Changes:

Added withNumReaders(int) builder method to configure the number of concurrent readers
Modified ReadFromSparkReceiverWithOffsetDoFn to accept Integer shard identifiers instead of byte[]
Implemented parallel work distribution using Create.of(shards) + Reshuffle pattern when numReaders > 1
Added test case testReadFromCustomReceiverWithParallelism to verify parallel reading behavior
Updated CHANGES.md to document the new feature

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
SparkReceiverIO.java	Added `numReaders` configuration, modified `expand()` method to support parallel reading with sharding, changed DoFn input type from `byte[]` to `Integer`
ReadFromSparkReceiverWithOffsetDoFn.java	Updated DoFn signature to accept `Integer` element type instead of `byte[]`, reformatted some comments
SparkReceiverIOTest.java	Added new test case for parallel reading with 3 readers
ReadFromSparkReceiverWithOffsetDoFnTest.java	Updated test element type from `byte[]` to `Integer`
CHANGES.md	Added entry documenting parallel reading support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-24T20:45:08Z

...a/io/sparkreceiver/3/src/main/java/org/apache/beam/sdk/io/sparkreceiver/SparkReceiverIO.java

+     * A number of workers to read from Spark {@link Receiver}.
+     *
+     * <p>If this value is not set, or set to 1, the reading will be performed on a single worker.


The documentation for withNumReaders claims it will allow reading from multiple workers, but it doesn't explain that each reader will independently read all data, resulting in duplicates. The documentation should clearly state the expected behavior - whether it's intended to duplicate data for redundancy or to distribute work without duplication. Currently, the behavior doesn't match the stated goal of addressing a "scalability bottleneck" since duplicating data doesn't improve scalability.

Suggested change

* A number of workers to read from Spark {@link Receiver}.

*

* If this value is not set, or set to 1, the reading will be performed on a single worker.

* Configures how many independent workers (readers) will read from the same Spark

* {@link Receiver}.

*

* Each configured reader connects to the underlying source independently and will

* typically observe the full stream of data. As a result, records may be duplicated

* across readers; this option does not shard or partition the input among workers.

*

* This setting is intended for use cases where redundant consumption of the same data

* is acceptable (for example, to increase robustness when dealing with flaky sources),

* and should not be used as a mechanism for load-balancing or avoiding scalability

* bottlenecks via input partitioning. If you require a single logical read without

* duplicates, leave {@code numReaders} at its default of {@code 1} and apply your own

* partitioning or deduplication to the resulting {@link PCollection}.

*

* If this value is not set, or set to {@code 1}, the reading will be performed on a

* single worker.

Good catch. I have refactored the implementation to support proper sharding.

I added setShard(int shardId, int numShards) to the HasOffset interface.

The DoFn now passes the unique shard ID to the Receiver via setShard.

I updated the documentation to clarify that the receiver is expected to handle partitioning based on these parameters.

Copilot · 2026-01-24T20:45:08Z

...a/io/sparkreceiver/3/src/main/java/org/apache/beam/sdk/io/sparkreceiver/SparkReceiverIO.java

      checkArgument(startPollTimeoutSec != null, "Start poll timeout can not be null");
      return toBuilder().setStartPollTimeoutSec(startPollTimeoutSec).build();
    }



The Javadoc comment for withStartOffset was removed, creating inconsistency with the other configuration methods (withPullFrequencySec, withStartPollTimeoutSec, etc.) which all have Javadoc comments. The comment should be restored to maintain documentation consistency across the API.

Suggested change

/** Inclusive start offset from which the reading should be started. */

Fixed. I have restored the Javadoc for withStartOffset to match the consistency of the other methods.

Copilot · 2026-01-24T20:45:09Z

.../src/main/java/org/apache/beam/sdk/io/sparkreceiver/ReadFromSparkReceiverWithOffsetDoFn.java

  public ProcessContinuation processElement(
-      @Element byte[] element,
+      @Element Integer element,


The element parameter (representing the shard ID) is never used in the processElement method. This means that when multiple readers are configured (via withNumReaders), each DoFn instance will independently create a SparkReceiver starting from the same startOffset, resulting in duplicate data being read. For example, with 3 readers, the same 20 records will be read 3 times, producing 60 total records with duplicates.

This defeats the purpose of parallel reading for scalability. The shard ID should be used to either:

Partition the offset range among readers (e.g., reader 0 reads offsets 0-6, reader 1 reads 7-13, reader 2 reads 14-19), or

Use the shard ID to configure the Spark Receiver to read from different partitions/sources

Without this coordination, the feature creates duplicate data rather than distributing work.

Fixed. The processElement method now uses the element (Shard ID) and passes it to the receiver:

if (sparkReceiver instanceof HasOffset) { ((HasOffset) sparkReceiver).setShard(element, numReaders); }

Copilot · 2026-01-24T20:45:09Z

.../sparkreceiver/3/src/test/java/org/apache/beam/sdk/io/sparkreceiver/SparkReceiverIOTest.java

+    for (int j = 0; j < 3; j++) {
+      for (int i = 0; i < CustomReceiverWithOffset.RECORDS_COUNT; i++) {
+        expected.add(String.valueOf(i));
+      }


The test expects 60 records (3 readers * 20 records each), which means it's testing for duplicate data rather than proper parallel reading. This test validates the buggy behavior where each reader independently reads all data from the source. The test should instead verify that 20 unique records are read once, distributed across the 3 readers without duplication.

Suggested change

for (int j = 0; j < 3; j++) {

for (int i = 0; i < CustomReceiverWithOffset.RECORDS_COUNT; i++) {

expected.add(String.valueOf(i));

}

for (int i = 0; i < CustomReceiverWithOffset.RECORDS_COUNT; i++) {

expected.add(String.valueOf(i));

Agreed. I have updated CustomReceiverWithOffset to implement the setShard logic (using simple modulo partitioning).
I also updated the test assertion to expect 20 unique records total strings '0' to '19' exactly once), verifying that the work is correctly distributed without duplication.

Copilot · 2026-01-24T20:45:09Z

...a/io/sparkreceiver/3/src/main/java/org/apache/beam/sdk/io/sparkreceiver/SparkReceiverIO.java

+              .apply(Impulse.create())
+              .apply(
+                  MapElements.into(TypeDescriptors.integers())
+                      .via(
+                          new SerializableFunction<byte[], Integer>() {
+                            @Override
+                            public Integer apply(byte[] input) {
+                              return 0;
+                            }
+                          }))


The backward compatibility path introduces unnecessary complexity by converting the Impulse output (byte[]) to Integer via MapElements. This conversion is not needed - you could directly use Impulse.create() followed by MapElements to produce an Integer shard ID of 0, or better yet, use Create.of(0) directly for consistency with the multi-reader path. The current approach adds an extra transformation step without clear benefit.

Suggested change

.apply(Impulse.create())

.apply(

MapElements.into(TypeDescriptors.integers())

.via(

new SerializableFunction<byte[], Integer>() {

@Override

public Integer apply(byte[] input) {

return 0;

}

}))

.apply(Create.of(0))

Fixed. I removed the Impulse + MapElements chain and replaced it with Create.of(0) as suggested. This is cleaner and consistent with the multi-reader path.

1. Implemented setShard() in HasOffset to allow receivers to handle partitioning (prevents data duplication). 2. Updated CustomReceiverWithOffset to filter records based on shardId. 3. Updated DoFn to pass shardId/numShards to the receiver. 4. Restored Javadoc for withStartOffset. 5. Simplified backward compatibility logic in expand() using Create.of(). 6. Updated tests to verify parallel reading produces correct, unique record set.

github-actions · 2026-01-24T21:36:01Z

Assigning reviewers:

R: @Abacn for label java.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copilot AI review requested due to automatic review settings January 24, 2026 20:34

github-actions bot added java io labels Jan 24, 2026

Copilot started reviewing on behalf of ATHARVA262005 January 24, 2026 20:34 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

github-actions bot added the Next Action: Reviewers label Jan 24, 2026

-     * A number of workers to read from Spark {@link Receiver}.
-     *
-     * <p>If this value is not set, or set to 1, the reading will be performed on a single worker.
+     * Configures how many independent workers (readers) will read from the same Spark
+     * {@link Receiver}.
+     *
+     * <p>Each configured reader connects to the underlying source independently and will
+     * typically observe the full stream of data. As a result, records may be duplicated
+     * across readers; this option does <b>not</b> shard or partition the input among workers.
+     *
+     * <p>This setting is intended for use cases where redundant consumption of the same data
+     * is acceptable (for example, to increase robustness when dealing with flaky sources),
+     * and should not be used as a mechanism for load-balancing or avoiding scalability
+     * bottlenecks via input partitioning. If you require a single logical read without
+     * duplicates, leave {@code numReaders} at its default of {@code 1} and apply your own
+     * partitioning or deduplication to the resulting {@link PCollection}.
+     *
+     * <p>If this value is not set, or set to {@code 1}, the reading will be performed on a
+     * single worker.



	/** Inclusive start offset from which the reading should be started. */

feat(io): Implement parallel reading in SparkReceiverIO (Fixes #37410) #37411

Are you sure you want to change the base?

feat(io): Implement parallel reading in SparkReceiverIO (Fixes #37410) #37411

Conversation

ATHARVA262005 commented Jan 24, 2026

Description

Key Changes:

Issue Links

Type of change

Tests

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ATHARVA262005 Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ATHARVA262005 Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ATHARVA262005 Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ATHARVA262005 Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ATHARVA262005 Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant