Skip to content

Add egg-based query optimizer using equality saturation#12

Closed
alxmrs wants to merge 2 commits into
mainfrom
claude/implement-query-optimization-M5lmj
Closed

Add egg-based query optimizer using equality saturation#12
alxmrs wants to merge 2 commits into
mainfrom
claude/implement-query-optimization-M5lmj

Conversation

@alxmrs
Copy link
Copy Markdown
Collaborator

@alxmrs alxmrs commented Jan 21, 2026

This implements a query optimizer for zarr-datafusion using the egg library
for e-graph-based equality saturation optimization. Key components:

  • Language definition (ZarrPlan): Represents DataFusion logical plans in
    egg's e-graph format with support for relational operators, expressions,
    aggregates, joins, and Zarr-specific operations

  • Analysis (ZarrAnalysis): Tracks metadata including constant values for
    constant folding, cardinality estimates, and table statistics

  • Rewrite rules: Expression simplification (arithmetic, boolean), comparison
    simplification, relational optimizations (filter merge, limit, sort), and
    aggregate rules

  • Cost function: Multi-factor cost model considering I/O, computation, memory,
    and remote storage penalties with tunable parameters

  • Conversion functions: Bidirectional conversion between DataFusion
    LogicalPlan and egg RecExpr (LogicalPlan -> egg complete, egg -> LogicalPlan
    returns original for now)

  • EggOptimizerRule: DataFusion OptimizerRule implementation that integrates
    the egg optimizer into the query processing pipeline

The implementation follows the design document in docs/design/query-opt.md
and draws on lessons from the Tokomak project.

This implements a query optimizer for zarr-datafusion using the egg library
for e-graph-based equality saturation optimization. Key components:

- Language definition (ZarrPlan): Represents DataFusion logical plans in
  egg's e-graph format with support for relational operators, expressions,
  aggregates, joins, and Zarr-specific operations

- Analysis (ZarrAnalysis): Tracks metadata including constant values for
  constant folding, cardinality estimates, and table statistics

- Rewrite rules: Expression simplification (arithmetic, boolean), comparison
  simplification, relational optimizations (filter merge, limit, sort), and
  aggregate rules

- Cost function: Multi-factor cost model considering I/O, computation, memory,
  and remote storage penalties with tunable parameters

- Conversion functions: Bidirectional conversion between DataFusion
  LogicalPlan and egg RecExpr (LogicalPlan -> egg complete, egg -> LogicalPlan
  returns original for now)

- EggOptimizerRule: DataFusion OptimizerRule implementation that integrates
  the egg optimizer into the query processing pipeline

The implementation follows the design document in docs/design/query-opt.md
and draws on lessons from the Tokomak project.
Copy link
Copy Markdown
Collaborator Author

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an initial issues. Will continue with the review.

@@ -0,0 +1,458 @@
//! Conversion between DataFusion LogicalPlan and egg RecExpr
//!
//! This module handles bidirectional conversion:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make round trip property based tests with this in a future PR!

// 2. Look up table sources from the original plan
// 3. Reconstruct schema information
//
// This is a significant undertaking and would require careful handling
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: need to implement this.

Copy link
Copy Markdown
Collaborator Author

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes about the cost optimizer.

self.params.io_weight * bytes
}

ZarrPlan::ZarrScan([_path, _coords, _vars]) => {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tunable param: we should be able to set when IO is to a remote source or local disk.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, we should have this be data driven.

| ZarrPlan::Empty => 0.0,

// === Zarr-specific ===
ZarrPlan::Resample([input, _dim, _freq]) => {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into this, I'm not sure what it is referring to.

Copy link
Copy Markdown
Collaborator Author

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant issue.

}

#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs more tests.

//! - **Expressions**: arithmetic (+, -, *, /, %), comparison (=, <>, <, <=, >, >=)
//! - **Logical**: and, or, not, is_null, is_not_null
//! - **Aggregates**: count, sum, avg, min, max
//! - **Zarr-specific**: zarr_scan, coord_filter, resample
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires closer attention.

);

// Convert back to LogicalPlan
let optimized = match egg_to_logical_plan(&best_expr, &plan) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken.

- Implement complete bidirectional conversion between DataFusion LogicalPlan and egg RecExpr
- Add ConversionContext to preserve table scans and schemas during round-trip
- Handle all major plan types: Scan, Filter, Project, Aggregate, Sort, Limit, Distinct, Union, Joins
- Handle all major expression types: literals, columns, arithmetic, comparison, logical, aggregates
- Add property-based tests using proptest for scalar and expression round-trips
- Add Extreme Weather Bench inspired integration tests covering:
  - Hot days/threshold queries
  - Temporal and spatial aggregation
  - Compound filter conditions
  - Expression simplification (constant folding, boolean tautology/contradiction)
  - Sort and limit operations
  - ERA5 climate data patterns
@alxmrs alxmrs closed this Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants