-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(io): Implement parallel reading in SparkReceiverIO (Fixes #37410) #37411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -21,9 +21,13 @@ | |||||||||||||||||||||||||||||||||||||||
| import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkArgument; | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| import com.google.auto.value.AutoValue; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.transforms.Impulse; | ||||||||||||||||||||||||||||||||||||||||
| import java.util.List; | ||||||||||||||||||||||||||||||||||||||||
| import java.util.stream.Collectors; | ||||||||||||||||||||||||||||||||||||||||
| import java.util.stream.IntStream; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.transforms.Create; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.transforms.PTransform; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.transforms.ParDo; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.transforms.Reshuffle; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.transforms.SerializableFunction; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.values.PBegin; | ||||||||||||||||||||||||||||||||||||||||
| import org.apache.beam.sdk.values.PCollection; | ||||||||||||||||||||||||||||||||||||||||
|
|
@@ -99,6 +103,8 @@ public abstract static class Read<V> extends PTransform<PBegin, PCollection<V>> | |||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| abstract @Nullable Long getStartOffset(); | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| abstract @Nullable Integer getNumReaders(); | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| abstract Builder<V> toBuilder(); | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| @AutoValue.Builder | ||||||||||||||||||||||||||||||||||||||||
|
|
@@ -117,6 +123,8 @@ abstract Builder<V> setSparkReceiverBuilder( | |||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| abstract Builder<V> setStartOffset(Long startOffset); | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| abstract Builder<V> setNumReaders(Integer numReaders); | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| abstract Read<V> build(); | ||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
@@ -157,6 +165,16 @@ public Read<V> withStartOffset(Long startOffset) { | |||||||||||||||||||||||||||||||||||||||
| return toBuilder().setStartOffset(startOffset).build(); | ||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| /** | ||||||||||||||||||||||||||||||||||||||||
| * A number of workers to read from Spark {@link Receiver}. | ||||||||||||||||||||||||||||||||||||||||
| * | ||||||||||||||||||||||||||||||||||||||||
| * <p>If this value is not set, or set to 1, the reading will be performed on a single worker. | ||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+169
to
+171
|
||||||||||||||||||||||||||||||||||||||||
| * A number of workers to read from Spark {@link Receiver}. | |
| * | |
| * <p>If this value is not set, or set to 1, the reading will be performed on a single worker. | |
| * Configures how many independent workers (readers) will read from the same Spark | |
| * {@link Receiver}. | |
| * | |
| * <p>Each configured reader connects to the underlying source independently and will | |
| * typically observe the full stream of data. As a result, records may be duplicated | |
| * across readers; this option does <b>not</b> shard or partition the input among workers. | |
| * | |
| * <p>This setting is intended for use cases where redundant consumption of the same data | |
| * is acceptable (for example, to increase robustness when dealing with flaky sources), | |
| * and should not be used as a mechanism for load-balancing or avoiding scalability | |
| * bottlenecks via input partitioning. If you require a single logical read without | |
| * duplicates, leave {@code numReaders} at its default of {@code 1} and apply your own | |
| * partitioning or deduplication to the resulting {@link PCollection}. | |
| * | |
| * <p>If this value is not set, or set to {@code 1}, the reading will be performed on a | |
| * single worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I have refactored the implementation to support proper sharding.
- I added setShard(int shardId, int numShards) to the HasOffset interface.
- The DoFn now passes the unique shard ID to the Receiver via setShard.
- I updated the documentation to clarify that the receiver is expected to handle partitioning based on these parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
elementparameter (representing the shard ID) is never used in the processElement method. This means that when multiple readers are configured (via withNumReaders), each DoFn instance will independently create a SparkReceiver starting from the same startOffset, resulting in duplicate data being read. For example, with 3 readers, the same 20 records will be read 3 times, producing 60 total records with duplicates.This defeats the purpose of parallel reading for scalability. The shard ID should be used to either:
Without this coordination, the feature creates duplicate data rather than distributing work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. The processElement method now uses the
element(Shard ID) and passes it to the receiver: