No-polars in ggsql by thomasp85 · Pull Request #350 · posit-dev/ggsql

thomasp85 · 2026-04-22T07:36:46Z

Replace polars with arrow-rs

Why?

polars is the single largest dependency in ggsql — 328 transitive crates — yet it's used almost entirely as a passive data container. The real work happens in SQL (via DuckDB/SQLite), and DuckDB already requires arrow (92 crates). Dropping polars eliminates ~236 transitive crates with no loss of functionality.

Verified dep count: ggsql drops from 418 → 182 transitive crates.

Approach: thin DataFrame wrapper around arrow::RecordBatch

Rather than using RecordBatch directly (immutable, no column-by-name lookup, missing constructors), the PR introduces a thin wrapper that provides the ~12 methods the codebase actually uses. This was the lowest-churn path across ~50 affected files.

Three new modules form the migration foundation:

src/dataframe.rs: The DataFrame wrapper + df! test macro. Wraps RecordBatch, exposes height/width/column/with_column/rename/drop/replace/slice/…
src/array_util.rs: Replaces polars' series.f64() / series.str() with as_f64(array) / as_str(array) downcasts; plus constructors, cast_array, fill_null_f64, value_to_string
src/compute.rs: Grouped window ops for position adjustments: sort_dataframe, compute_group_ids, grouped_cumsum, grouped_cumsum_lag, grouped_sum_broadcast

The hard part: position adjustments

stack.rs was the only place using polars' lazy API with grouped window functions (cum_sum().over(), shift(), fill_null()). We considered pushing position adjustments into SQL, but scale-type inference happens after query execution, so we'd hit a chicken-and-egg problem. Instead, ~50 lines of polars lazy expressions became ~120 lines of arrow compute calls in stack.rs, using the primitives in compute.rs. dodge.rs and jitter.rs followed the same pattern.

The position-adjustment tests are the primary acceptance criteria here — they encode a lot of tricky numeric behavior (fill/center modes, grouped cumsums, null handling).

Migrations across the codebase

Readers: duckdb.rs, sqlite.rs, odbc.rs — replaced polars::Series builders with arrow array builders. In DuckDB, dataframe_to_arrow_params simplifies to df.inner().clone() since our DataFrame is a RecordBatch.
Parquet reading (reader/data.rs): polars::ParquetReader → parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder. Added parquet as a direct dep.
DataType references (~25 files): mechanical rename — DataType::String → Utf8, Date → Date32, Datetime(µs, tz) → Timestamp(Microsecond, tz), Time → Time64(Nanosecond), Categorical → Utf8.
Writers (vegalite/data.rs, encoding.rs, layer.rs): series downcasts → arrow downcasts with explicit null checks (arrow doesn't auto-skip nulls the way polars' iterators do).
ggsql-wasm: same pattern — polars Series construction → Arc etc. WASM-specific getrandom/uuid feature overrides added for wasm32-unknown-unknown.

Gotchas worth reviewer attention

Parquet file regeneration. The bundled penguins.parquet and airquality.parquet were originally written by R's nanoparquet package, which produced an ARROW:schema blob that fails flatbuffers alignment in arrow-rs. They were regenerated with arrow-rs itself. Documentation at the top of reader/data.rs calls out which writers are known compatible (pyarrow/arrow-rs/DuckDB) vs. incompatible (nanoparquet). A new test all_builtin_parquets_load iterates KNOWN_DATASETS so CI catches any future incompatible additions.
Temporal ↔ floating casts. Arrow's compute::cast can't cross the temporal/floating boundary directly — Date32 → Float64 fails, you have to go via Int32. Rather than special-case every call site (there are ~15 of them), array_util::cast_array was extended to bridge these conversions transparently via the integer backing type. This was discovered by two user-reported bugs during review (histogram on an Int64 column; boxplot with a Date x-axis + SCALE BINNED).
ggsql-python removed from the monorepo. It now lives in its own repo, so this PR doesn't touch it.
Versioning. All workspace crates now inherit version.workspace = true. pyproject.toml in the (external) Python package is not auto-synced — noted for a future release-script task.

Test coverage

1343 unit tests pass, 0 failed, 1 ignored.
All existing position-adjustment tests were preserved and pass unchanged, which was the strictest signal that the arrow rewrite of stack/dodge/jitter is behavior-equivalent.
New tests added for the cases that surfaced during review: cast_array temporal↔floating bridging in array_util, apply_oob_to_column_numeric with Date32 in execute/scale.rs, and the histogram null-error path.

What to look at first as a reviewer

src/dataframe.rs — the API surface everything else depends on. If this is right, the rest of the churn is mechanical.
src/plot/layer/position/stack.rs — the only genuinely non-mechanical rewrite; worth reading against the polars version in main to convince yourself the arrow compute chain is equivalent.
src/array_util.rs::cast_array — the temporal/floating bridge. Subtle but high-leverage because many call sites rely on it.
src/reader/data.rs — parquet compatibility docs + the iterating test that keeps future datasets honest.

PR summary written by Claude

thomasp85 · 2026-04-22T10:01:55Z

/format

github-actions · 2026-04-22T10:02:07Z

❌ /format failed. If this is a fork PR, make sure "Allow edits from maintainers" is enabled.

thomasp85 · 2026-04-22T10:06:00Z

/format

github-actions · 2026-04-22T10:06:21Z

✨ Formatted and pushed.

teunbrand · 2026-04-22T14:44:20Z

I think a few claude.md lines still have mentions of polars

ggsql/CLAUDE.md

Line 127 in d053660

│ (Polars) │ │

ggsql/CLAUDE.md

Line 537 in d053660

- SQL execution → Polars DataFrame conversion

ggsql/CLAUDE.md

Line 1336 in d053660

ResultSet → DataFrame (Polars)

georgestagg

I am 28/57 files, but posting an initial set of comments now to get the ball rolling.

Also consider the following suggestions, from codex (with a grain of salt and my apologies for the direct LLM copy/paste):

Finding 2 — DataFrame::drop_by_index loses row count
src/dataframe.rs:287-288 — When dropping the last column, it returns Self::empty() which is 0×0. Annotation
layers that have only literal columns can collapse to zero rows, causing marks to disappear silently.

Finding 3 — drop_many swallows errors
src/dataframe.rs:210-217 — Returns Self::empty() both when all columns are dropped (same row-count issue) and
when RecordBatch::try_new fails (silent data loss).

georgestagg · 2026-04-22T14:39:34Z

+        DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),
+        _ => format!("{:?}", array.data_type()),


Some extra types, suggested by Codex,

Suggested change

DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),

_ => format!("{:?}", array.data_type()),

DataType::LargeUtf8 => array

.as_any()

.downcast_ref::<LargeStringArray>()

.unwrap()

.value(idx)

.to_string(),

DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),

DataType::Date32 => {

let days = as_date32(array).unwrap().value(idx);

format!("{}", days)

}

DataType::Date64 => {

let ms = array

.as_any()

.downcast_ref::<arrow::array::Date64Array>()

.unwrap()

.value(idx);

format!("{}", ms)

}

_ => arrow::util::display::ArrayFormatter::try_new(array.as_ref(), &Default::default())

.and_then(|f| Ok(f.value(idx).to_string()))

.unwrap_or_else(|_| format!("{:?}", array.data_type())),

This does seem to fix the area geom bug discussed off-GitHub.

georgestagg · 2026-04-22T14:41:36Z

        let mut tmp_path = env::temp_dir();
        tmp_path.push(format!("{}.parquet", name));
        if !tmp_path.exists() {
            fs::write(&tmp_path, parquet_bytes).expect("Failed to write dataset");


Suggested change

fs::write(&tmp_path, parquet_bytes).map_err(|e| {

GgsqlError::ReaderError(format!(

"Failed to write builtin dataset '{}' to {}: {}",

name,

tmp_path.display(),

e

))

})?;

I don't know why this flagged with this PR, but it seems a reasonable change to raise the error rather than panic.

georgestagg · 2026-04-22T14:42:18Z

+// Known-compatible writers:
+//   - Python `pyarrow`            (`pq.write_table(...)`)
+//   - Rust `arrow-rs` + `parquet` (`ArrowWriter`)
+//   - DuckDB                      (`COPY ... TO 'file.parquet'`)
+//
+// Known-incompatible writers:
+//   - R `nanoparquet` — writes ARROW:schema with a different flatbuffers
+//     alignment that arrow-rs's strict reader rejects.


This is a shame :(

Yeah - though I don't think it is that bad. It is only for internal datasets and we just have to format them correctly. There was a workaround for it but I figured it was better to not have weird stuff in the code base because we couldn't be bothered to rewrite the included data

thomasp85 added 9 commits April 22, 2026 08:55

remove polars from main project

39ae468

remove polars from wasm

81803bf

Merge commit '0c8ad3f03ac06f6e288087e64fa73b899c1f7006'

82de7b9

fix wasm and reformat

19d6c60

appease clippy

2a7ec6a

add changelog

6a27d38

reformat again

d58f180

fix histogram stat

df1297a

fix casting of arrays

ab91cc8

style: cargo fmt

ffafbd8

thomasp85 requested a review from georgestagg April 22, 2026 11:08

georgestagg reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No-polars in ggsql#350

No-polars in ggsql#350
thomasp85 wants to merge 10 commits intomainfrom
no-polar

thomasp85 commented Apr 22, 2026 •

edited

Loading

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

teunbrand commented Apr 22, 2026

Uh oh!

georgestagg left a comment

Uh oh!

georgestagg Apr 22, 2026

Uh oh!

georgestagg Apr 22, 2026

Uh oh!

georgestagg Apr 22, 2026

Uh oh!

thomasp85 Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),
		_ => format!("{:?}", array.data_type()),

-        DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),
-        _ => format!("{:?}", array.data_type()),
+        DataType::LargeUtf8 => array
+            .as_any()
+            .downcast_ref::<LargeStringArray>()
+            .unwrap()
+            .value(idx)
+            .to_string(),
+        DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(),
+        DataType::Date32 => {
+            let days = as_date32(array).unwrap().value(idx);
+            format!("{}", days)
+        }
+        DataType::Date64 => {
+            let ms = array
+                .as_any()
+                .downcast_ref::<arrow::array::Date64Array>()
+                .unwrap()
+                .value(idx);
+            format!("{}", ms)
+        }
+        _ => arrow::util::display::ArrayFormatter::try_new(array.as_ref(), &Default::default())
+            .and_then(|f| Ok(f.value(idx).to_string()))
+            .unwrap_or_else(|_| format!("{:?}", array.data_type())),

+            fs::write(&tmp_path, parquet_bytes).map_err(|e| {
+                GgsqlError::ReaderError(format!(
+                    "Failed to write builtin dataset '{}' to {}: {}",
+                    name,
+                    tmp_path.display(),
+                    e
+                ))
+            })?;

Conversation

thomasp85 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Replace polars with arrow-rs

Why?

Approach: thin DataFrame wrapper around arrow::RecordBatch

The hard part: position adjustments

Migrations across the codebase

Gotchas worth reviewer attention

Test coverage

What to look at first as a reviewer

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

teunbrand commented Apr 22, 2026

Uh oh!

georgestagg left a comment

Choose a reason for hiding this comment

Uh oh!

georgestagg Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

georgestagg Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

georgestagg Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

thomasp85 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomasp85 commented Apr 22, 2026 •

edited

Loading