Skip to content

No-polars in ggsql#350

Open
thomasp85 wants to merge 10 commits intomainfrom
no-polar
Open

No-polars in ggsql#350
thomasp85 wants to merge 10 commits intomainfrom
no-polar

Conversation

@thomasp85
Copy link
Copy Markdown
Collaborator

@thomasp85 thomasp85 commented Apr 22, 2026

Replace polars with arrow-rs

Why?

polars is the single largest dependency in ggsql — 328 transitive crates — yet it's used almost entirely as a passive data container. The real work happens in SQL (via DuckDB/SQLite), and DuckDB already requires arrow (92 crates). Dropping polars eliminates ~236 transitive crates with no loss of functionality.

Verified dep count: ggsql drops from 418 → 182 transitive crates.

Approach: thin DataFrame wrapper around arrow::RecordBatch

Rather than using RecordBatch directly (immutable, no column-by-name lookup, missing constructors), the PR introduces a thin wrapper that provides the ~12 methods the codebase actually uses. This was the lowest-churn path across ~50 affected files.

Three new modules form the migration foundation:

  • src/dataframe.rs: The DataFrame wrapper + df! test macro. Wraps RecordBatch, exposes height/width/column/with_column/rename/drop/replace/slice/…
  • src/array_util.rs: Replaces polars' series.f64() / series.str() with as_f64(array) / as_str(array) downcasts; plus constructors, cast_array, fill_null_f64, value_to_string
  • src/compute.rs: Grouped window ops for position adjustments: sort_dataframe, compute_group_ids, grouped_cumsum, grouped_cumsum_lag, grouped_sum_broadcast

The hard part: position adjustments

stack.rs was the only place using polars' lazy API with grouped window functions (cum_sum().over(), shift(), fill_null()). We considered pushing position adjustments into SQL, but scale-type inference happens after query execution, so we'd hit a chicken-and-egg problem. Instead, ~50 lines of polars lazy expressions became ~120 lines of arrow compute calls in stack.rs, using the primitives in compute.rs. dodge.rs and jitter.rs followed the same pattern.

The position-adjustment tests are the primary acceptance criteria here — they encode a lot of tricky numeric behavior (fill/center modes, grouped cumsums, null handling).

Migrations across the codebase

  • Readers: duckdb.rs, sqlite.rs, odbc.rs — replaced polars::Series builders with arrow array builders. In DuckDB, dataframe_to_arrow_params simplifies to df.inner().clone() since our DataFrame is a RecordBatch.
  • Parquet reading (reader/data.rs): polars::ParquetReader → parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder. Added parquet as a direct dep.
  • DataType references (~25 files): mechanical rename — DataType::String → Utf8, Date → Date32, Datetime(µs, tz) → Timestamp(Microsecond, tz), Time → Time64(Nanosecond), Categorical → Utf8.
  • Writers (vegalite/data.rs, encoding.rs, layer.rs): series downcasts → arrow downcasts with explicit null checks (arrow doesn't auto-skip nulls the way polars' iterators do).
  • ggsql-wasm: same pattern — polars Series construction → Arc etc. WASM-specific getrandom/uuid feature overrides added for wasm32-unknown-unknown.

Gotchas worth reviewer attention

  1. Parquet file regeneration. The bundled penguins.parquet and airquality.parquet were originally written by R's nanoparquet package, which produced an ARROW:schema blob that fails flatbuffers alignment in arrow-rs. They were regenerated with arrow-rs itself. Documentation at the top of reader/data.rs calls out which writers are known compatible (pyarrow/arrow-rs/DuckDB) vs. incompatible (nanoparquet). A new test all_builtin_parquets_load iterates KNOWN_DATASETS so CI catches any future incompatible additions.
  2. Temporal ↔ floating casts. Arrow's compute::cast can't cross the temporal/floating boundary directly — Date32 → Float64 fails, you have to go via Int32. Rather than special-case every call site (there are ~15 of them), array_util::cast_array was extended to bridge these conversions transparently via the integer backing type. This was discovered by two user-reported bugs during review (histogram on an Int64 column; boxplot with a Date x-axis + SCALE BINNED).
  3. ggsql-python removed from the monorepo. It now lives in its own repo, so this PR doesn't touch it.
  4. Versioning. All workspace crates now inherit version.workspace = true. pyproject.toml in the (external) Python package is not auto-synced — noted for a future release-script task.

Test coverage

  • 1343 unit tests pass, 0 failed, 1 ignored.
  • All existing position-adjustment tests were preserved and pass unchanged, which was the strictest signal that the arrow rewrite of stack/dodge/jitter is behavior-equivalent.
  • New tests added for the cases that surfaced during review: cast_array temporal↔floating bridging in array_util, apply_oob_to_column_numeric with Date32 in execute/scale.rs, and the histogram null-error path.

What to look at first as a reviewer

  • src/dataframe.rs — the API surface everything else depends on. If this is right, the rest of the churn is mechanical.
  • src/plot/layer/position/stack.rs — the only genuinely non-mechanical rewrite; worth reading against the polars version in main to convince yourself the arrow compute chain is equivalent.
  • src/array_util.rs::cast_array — the temporal/floating bridge. Subtle but high-leverage because many call sites rely on it.
  • src/reader/data.rs — parquet compatibility docs + the iterating test that keeps future datasets honest.

PR summary written by Claude

@thomasp85
Copy link
Copy Markdown
Collaborator Author

/format

@github-actions
Copy link
Copy Markdown

/format failed. If this is a fork PR, make sure "Allow edits from maintainers" is enabled.

@thomasp85
Copy link
Copy Markdown
Collaborator Author

/format

@github-actions
Copy link
Copy Markdown

✨ Formatted and pushed.

@thomasp85 thomasp85 requested a review from georgestagg April 22, 2026 11:08
@teunbrand
Copy link
Copy Markdown
Collaborator

I think a few claude.md lines still have mentions of polars

ggsql/CLAUDE.md

Line 127 in d053660

│ (Polars) │ │

ggsql/CLAUDE.md

Line 537 in d053660

- SQL execution → Polars DataFrame conversion

ggsql/CLAUDE.md

Line 1336 in d053660

ResultSetDataFrame (Polars)

Copy link
Copy Markdown
Collaborator

@georgestagg georgestagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 28/57 files, but posting an initial set of comments now to get the ball rolling.

Also consider the following suggestions, from codex (with a grain of salt and my apologies for the direct LLM copy/paste):

Finding 2 — DataFrame::drop_by_index loses row count
src/dataframe.rs:287-288 — When dropping the last column, it returns Self::empty() which is 0×0. Annotation
layers that have only literal columns can collapse to zero rows, causing marks to disappear silently.

Finding 3 — drop_many swallows errors
src/dataframe.rs:210-217 — Returns Self::empty() both when all columns are dropped (same row-count issue) and
when RecordBatch::try_new fails (silent data loss).

Comment thread src/array_util.rs
Comment on lines +187 to +188
DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),
_ => format!("{:?}", array.data_type()),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra types, suggested by Codex,

Suggested change
DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),
_ => format!("{:?}", array.data_type()),
DataType::LargeUtf8 => array
.as_any()
.downcast_ref::<LargeStringArray>()
.unwrap()
.value(idx)
.to_string(),
DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),
DataType::Date32 => {
let days = as_date32(array).unwrap().value(idx);
format!("{}", days)
}
DataType::Date64 => {
let ms = array
.as_any()
.downcast_ref::<arrow::array::Date64Array>()
.unwrap()
.value(idx);
format!("{}", ms)
}
_ => arrow::util::display::ArrayFormatter::try_new(array.as_ref(), &Default::default())
.and_then(|f| Ok(f.value(idx).to_string()))
.unwrap_or_else(|_| format!("{:?}", array.data_type())),

This does seem to fix the area geom bug discussed off-GitHub.

Comment thread src/reader/data.rs
let mut tmp_path = env::temp_dir();
tmp_path.push(format!("{}.parquet", name));
if !tmp_path.exists() {
fs::write(&tmp_path, parquet_bytes).expect("Failed to write dataset");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fs::write(&tmp_path, parquet_bytes).map_err(|e| {
GgsqlError::ReaderError(format!(
"Failed to write builtin dataset '{}' to {}: {}",
name,
tmp_path.display(),
e
))
})?;

I don't know why this flagged with this PR, but it seems a reasonable change to raise the error rather than panic.

Comment thread src/reader/data.rs
Comment on lines +21 to +28
// Known-compatible writers:
// - Python `pyarrow` (`pq.write_table(...)`)
// - Rust `arrow-rs` + `parquet` (`ArrowWriter`)
// - DuckDB (`COPY ... TO 'file.parquet'`)
//
// Known-incompatible writers:
// - R `nanoparquet` — writes ARROW:schema with a different flatbuffers
// alignment that arrow-rs's strict reader rejects.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a shame :(

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - though I don't think it is that bad. It is only for internal datasets and we just have to format them correctly. There was a workaround for it but I figured it was better to not have weird stuff in the code base because we couldn't be bothered to rewrite the included data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants