Add egg-based query optimizer using equality saturation#12
Conversation
This implements a query optimizer for zarr-datafusion using the egg library for e-graph-based equality saturation optimization. Key components: - Language definition (ZarrPlan): Represents DataFusion logical plans in egg's e-graph format with support for relational operators, expressions, aggregates, joins, and Zarr-specific operations - Analysis (ZarrAnalysis): Tracks metadata including constant values for constant folding, cardinality estimates, and table statistics - Rewrite rules: Expression simplification (arithmetic, boolean), comparison simplification, relational optimizations (filter merge, limit, sort), and aggregate rules - Cost function: Multi-factor cost model considering I/O, computation, memory, and remote storage penalties with tunable parameters - Conversion functions: Bidirectional conversion between DataFusion LogicalPlan and egg RecExpr (LogicalPlan -> egg complete, egg -> LogicalPlan returns original for now) - EggOptimizerRule: DataFusion OptimizerRule implementation that integrates the egg optimizer into the query processing pipeline The implementation follows the design document in docs/design/query-opt.md and draws on lessons from the Tokomak project.
alxmrs
left a comment
There was a problem hiding this comment.
Found an initial issues. Will continue with the review.
| @@ -0,0 +1,458 @@ | |||
| //! Conversion between DataFusion LogicalPlan and egg RecExpr | |||
| //! | |||
| //! This module handles bidirectional conversion: | |||
There was a problem hiding this comment.
We can make round trip property based tests with this in a future PR!
| // 2. Look up table sources from the original plan | ||
| // 3. Reconstruct schema information | ||
| // | ||
| // This is a significant undertaking and would require careful handling |
There was a problem hiding this comment.
TODO: need to implement this.
alxmrs
left a comment
There was a problem hiding this comment.
A few notes about the cost optimizer.
| self.params.io_weight * bytes | ||
| } | ||
|
|
||
| ZarrPlan::ZarrScan([_path, _coords, _vars]) => { |
There was a problem hiding this comment.
Tunable param: we should be able to set when IO is to a remote source or local disk.
There was a problem hiding this comment.
Further, we should have this be data driven.
| | ZarrPlan::Empty => 0.0, | ||
|
|
||
| // === Zarr-specific === | ||
| ZarrPlan::Resample([input, _dim, _freq]) => { |
There was a problem hiding this comment.
Look into this, I'm not sure what it is referring to.
alxmrs
left a comment
There was a problem hiding this comment.
Significant issue.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
| //! - **Expressions**: arithmetic (+, -, *, /, %), comparison (=, <>, <, <=, >, >=) | ||
| //! - **Logical**: and, or, not, is_null, is_not_null | ||
| //! - **Aggregates**: count, sum, avg, min, max | ||
| //! - **Zarr-specific**: zarr_scan, coord_filter, resample |
There was a problem hiding this comment.
Requires closer attention.
| ); | ||
|
|
||
| // Convert back to LogicalPlan | ||
| let optimized = match egg_to_logical_plan(&best_expr, &plan) { |
- Implement complete bidirectional conversion between DataFusion LogicalPlan and egg RecExpr - Add ConversionContext to preserve table scans and schemas during round-trip - Handle all major plan types: Scan, Filter, Project, Aggregate, Sort, Limit, Distinct, Union, Joins - Handle all major expression types: literals, columns, arithmetic, comparison, logical, aggregates - Add property-based tests using proptest for scalar and expression round-trips - Add Extreme Weather Bench inspired integration tests covering: - Hot days/threshold queries - Temporal and spatial aggregation - Compound filter conditions - Expression simplification (constant folding, boolean tautology/contradiction) - Sort and limit operations - ERA5 climate data patterns
This implements a query optimizer for zarr-datafusion using the egg library
for e-graph-based equality saturation optimization. Key components:
Language definition (ZarrPlan): Represents DataFusion logical plans in
egg's e-graph format with support for relational operators, expressions,
aggregates, joins, and Zarr-specific operations
Analysis (ZarrAnalysis): Tracks metadata including constant values for
constant folding, cardinality estimates, and table statistics
Rewrite rules: Expression simplification (arithmetic, boolean), comparison
simplification, relational optimizations (filter merge, limit, sort), and
aggregate rules
Cost function: Multi-factor cost model considering I/O, computation, memory,
and remote storage penalties with tunable parameters
Conversion functions: Bidirectional conversion between DataFusion
LogicalPlan and egg RecExpr (LogicalPlan -> egg complete, egg -> LogicalPlan
returns original for now)
EggOptimizerRule: DataFusion OptimizerRule implementation that integrates
the egg optimizer into the query processing pipeline
The implementation follows the design document in docs/design/query-opt.md
and draws on lessons from the Tokomak project.