Introduce morsel-driven Parquet scan#20481
Conversation
This PR implements morsel-driven execution for Parquet files in DataFusion, enabling row-group level work sharing across partitions to mitigate data skew. Key changes: - Introduced `WorkQueue` in `datafusion/datasource/src/file_stream.rs` for shared pool of work. - Added `morselize` method to `FileOpener` trait to allow dynamic splitting of files into morsels. - Implemented `morselize` for `ParquetOpener` to split files into individual row groups. - Cached `ParquetMetaData` in `ParquetMorsel` extensions to avoid redundant I/O. - Modified `FileStream` to support work stealing from the shared queue. - Implemented `Weak` pointer pattern for `WorkQueue` in `FileScanConfig` to support plan re-executability. - Added `MorselizingGuard` to ensure shared state consistency on cancellation. - Added `allow_morsel_driven` configuration option (enabled by default for Parquet). - Implemented row-group pruning during the morselization phase for better efficiency. Tests: - Added `parquet_morsel_driven_execution` test to verify work distribution and re-executability. - Added `parquet_morsel_driven_enabled_by_default` to verify the default configuration. Co-authored-by: Dandandan <163737+Dandandan@users.noreply.github.com>
|
run benchmarks |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
Huh? |
9453b05 to
3384b8f
Compare
|
run benchmarks |
|
Show benchmark queue |
|
🤖 Hi @Dandandan, you asked to view the benchmark queue (#20481 (comment)).
|
|
@alamb this is now mostly ready I will update the issue with my latest benchmark results (it's slightly better than in the issue description, especially for filter pushdown) and also run Q6 of clickbench_extended as a ~160x improvement seems a bit too good to be true |
Which issue does this PR close?
Rationale for this change
Current parelllization of Parquet scan is bounded by the thread that has the most data / is the slowest to execute, which means in the case of data skew (driven by either larger partitions or less selective filters during pruning / filter pushdown..., variable object store latency), the parallelism will be significantly limited.
We can change the strategy by morsel-driven parallelism like described in https://db.in.tum.de/~leis/papers/morsels.pdf.
Doing so is faster for a lot of queries, when there is an amount of skew (such as clickbench) and we have enough row filters to spread out the work.
For clickbench_partitioned / clickbench_pushdown it seems up to ~2x as fast for some queries, on a 10 core machine.
It seems to have almost no regressions, perhaps some due to different file scanning order(?) - so different statistics that can be used to prune and thus some minor variation.
Morsel-Driven Execution Architecture (partly claude-generated)
This branch implements a morsel-driven execution model for Parquet scans, based on the concept
from the Morsel-Driven Parallelism paper (Leis et al.). The core idea: instead of statically
assigning files to partitions, all work is pooled in a shared queue that all partition streams pull
from dynamically.
The Problem It Solves
In the traditional model, partition 0 might get a 1 GB file while partition 1 gets nothing --
partition 1 idles while 0 is busy. Currently we already try to statically spread out work to n partitions / threads based on stats (which works very well on perfectly distributed scans on SSDs (e.g. TPCH running locally), this doesn't work well when there is any data skew caused by any of those:
filters being more selective on part of the data
high variation in object store response times
Morsel-driven execution prevents this by sharing work dynamically.
Key Types
ParquetMorsel--datafusion/datasource-parquet/src/opener.rs:129A morsel = one row group of a Parquet file. Stored as an extension on
PartitionedFile.MorselizingGuard--datafusion/datasource/src/file_stream.rs:49RAII wrapper that atomically decrements
morselizing_countwhen a worker finishes -- enablingWorkStatus::WaitvsDonedecisions.FileOpenerTrait Extension --datafusion/datasource/src/file_stream.rs:498A new
morselize()method is added toFileOpener. The default implementation is a no-op(returns the file as-is).
ParquetOpeneroverrides it to split files by row group.ParquetOpener::morselize()atopener.rs:232:Arcacross all resulting morsels)PartitionedFileper surviving row group, each carrying aParquetMorselextensionFileStreamState Machine --datafusion/datasource/src/file_stream.rs:141The morsel-driven path adds two new states (
MorselizingandWaiting):Configuration
datafusion.execution.parquet.allow_morsel_driven -- datafusion/common/src/config.rs:748
Default: true. Can be disabled per-session.
FileScanConfig::morsel_driven -- datafusion/datasource/src/file_scan_config.rs:211
Automatically disabled when:
partitioned_by_file_group = true(breaks hash-partitioning guarantees)preserve_order = true(breaks SortPreservingMerge guarantees)Benchmark results
Details
What changes are included in this PR?
Changes Parquet to create morsels when "opening" files based on rowgroups (this strategy could be changed, e.g. by splitting large rowgroups based on size / page index - this is TBD as current benchmarks probably don't benefit). The current design allows for this.
The morsels are shared in a workqueue, each partition will take some morsel from the queue, "morselize" (e.g. split into individual morsels) it or just process the data.
Are these changes tested?
Are there any user-facing changes?
Yes, FileStream::new got a new
shared_queueparameter.