Skip to content

Cross-dataset arithmetic fails — alias syntax not parsed, multi-source queries broken for datasets #2

@jayendra13

Description

@jayendra13

Summary

Two related issues prevent computing expressions across two datasets (e.g., forecast.temp - target.temp):

  1. Alias syntax doesn't parse for datasets: FROM dataset1 a, dataset2 b fails with a parse error.
  2. Multi-variable SELECT from datasets not supported: Even within a single dataset, SELECT u_wind - v_wind FROM climate fails.

Reproduction

Issue 1: Alias syntax parse failure

REGISTER DATASET climate FROM 'zarr:///path/to/tutorial_climate.zarr';

-- Expected: compute self-difference (should be all zeros)
-- Actual: parse error
SELECT a.temperature - b.temperature FROM climate a, climate b;
-- Error: query error: unexpected trailing tokens after statement: Ident("a")

The parser doesn't recognize FROM <dataset_name> <alias> — it treats the alias as a new statement.

Issue 2: Multi-variable expressions from datasets

-- Expected: element-wise difference between u_wind and v_wind arrays
-- Actual: error
SELECT u_wind - v_wind FROM climate;
-- Error: not implemented: multi-variable SELECT from datasets is not yet supported (found: u_wind, v_wind)

When a dataset has multiple data variables, the query planner can't resolve expressions that reference more than one variable.

Expected Behavior

Cross-dataset queries

REGISTER DATASET era5 FROM 'zarr:///path/to/era5.zarr';
REGISTER DATASET hres FROM 'zarr:///path/to/hres.zarr';

-- Should align on shared dimensions (lat, lon, time) and compute element-wise difference
SELECT a.temperature - b.temperature FROM era5 a, hres b;

Multi-variable expressions

-- Should compute element-wise difference between two variables sharing the same dimensions
SELECT u_wind - v_wind FROM climate;

-- Wind speed from components
SELECT sqrt(u_wind * u_wind + v_wind * v_wind) FROM climate;

Context

Cross-dataset arithmetic is essential for forecast evaluation metrics:

  • MAE = avg_cells(abs(forecast.temp - target.temp))
  • RMSE = sqrt(avg_cells((forecast.temp - target.temp)^2))

The array-level query system (REGISTER ARRAY) does support multi-source queries with aliases (FROM arr1 a, arr2 b) via the join logic in crates/arrdb-query/src/planner/logical.rs:209-240. The gap is in the dataset-to-array resolution layer — dataset queries need to expand variable references into the underlying array sources.

Files

  • crates/arrdb-query/src/parser/stmt.rs — FROM clause parsing for datasets may not handle aliases
  • crates/arrdb-query/src/planner/logical.rs:90-106 — multi-source plan building (works for arrays)
  • crates/arrdb-exec/src/session.rs — dataset query path needs multi-variable expression support

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions