Conversation
|
/format |
|
❌ |
|
/format |
|
✨ Formatted and pushed. |
georgestagg
left a comment
There was a problem hiding this comment.
I am 28/57 files, but posting an initial set of comments now to get the ball rolling.
Also consider the following suggestions, from codex (with a grain of salt and my apologies for the direct LLM copy/paste):
Finding 2 — DataFrame::drop_by_index loses row count
src/dataframe.rs:287-288 — When dropping the last column, it returns Self::empty() which is 0×0. Annotation
layers that have only literal columns can collapse to zero rows, causing marks to disappear silently.
Finding 3 — drop_many swallows errors
src/dataframe.rs:210-217 — Returns Self::empty() both when all columns are dropped (same row-count issue) and
when RecordBatch::try_new fails (silent data loss).
| DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(), | ||
| _ => format!("{:?}", array.data_type()), |
There was a problem hiding this comment.
Some extra types, suggested by Codex,
| DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(), | |
| _ => format!("{:?}", array.data_type()), | |
| DataType::LargeUtf8 => array | |
| .as_any() | |
| .downcast_ref::<LargeStringArray>() | |
| .unwrap() | |
| .value(idx) | |
| .to_string(), | |
| DataType::Boolean => as_bool(array).unwrap().value(idx).to_string(), | |
| DataType::Date32 => { | |
| let days = as_date32(array).unwrap().value(idx); | |
| format!("{}", days) | |
| } | |
| DataType::Date64 => { | |
| let ms = array | |
| .as_any() | |
| .downcast_ref::<arrow::array::Date64Array>() | |
| .unwrap() | |
| .value(idx); | |
| format!("{}", ms) | |
| } | |
| _ => arrow::util::display::ArrayFormatter::try_new(array.as_ref(), &Default::default()) | |
| .and_then(|f| Ok(f.value(idx).to_string())) | |
| .unwrap_or_else(|_| format!("{:?}", array.data_type())), |
This does seem to fix the area geom bug discussed off-GitHub.
| let mut tmp_path = env::temp_dir(); | ||
| tmp_path.push(format!("{}.parquet", name)); | ||
| if !tmp_path.exists() { | ||
| fs::write(&tmp_path, parquet_bytes).expect("Failed to write dataset"); |
There was a problem hiding this comment.
| fs::write(&tmp_path, parquet_bytes).map_err(|e| { | |
| GgsqlError::ReaderError(format!( | |
| "Failed to write builtin dataset '{}' to {}: {}", | |
| name, | |
| tmp_path.display(), | |
| e | |
| )) | |
| })?; |
I don't know why this flagged with this PR, but it seems a reasonable change to raise the error rather than panic.
| // Known-compatible writers: | ||
| // - Python `pyarrow` (`pq.write_table(...)`) | ||
| // - Rust `arrow-rs` + `parquet` (`ArrowWriter`) | ||
| // - DuckDB (`COPY ... TO 'file.parquet'`) | ||
| // | ||
| // Known-incompatible writers: | ||
| // - R `nanoparquet` — writes ARROW:schema with a different flatbuffers | ||
| // alignment that arrow-rs's strict reader rejects. |
There was a problem hiding this comment.
Yeah - though I don't think it is that bad. It is only for internal datasets and we just have to format them correctly. There was a workaround for it but I figured it was better to not have weird stuff in the code base because we couldn't be bothered to rewrite the included data
Replace polars with arrow-rs
Why?
polars is the single largest dependency in ggsql — 328 transitive crates — yet it's used almost entirely as a passive data container. The real work happens in SQL (via DuckDB/SQLite), and DuckDB already requires arrow (92 crates). Dropping polars eliminates ~236 transitive crates with no loss of functionality.
Verified dep count: ggsql drops from 418 → 182 transitive crates.
Approach: thin DataFrame wrapper around arrow::RecordBatch
Rather than using RecordBatch directly (immutable, no column-by-name lookup, missing constructors), the PR introduces a thin wrapper that provides the ~12 methods the codebase actually uses. This was the lowest-churn path across ~50 affected files.
Three new modules form the migration foundation:
The hard part: position adjustments
stack.rs was the only place using polars' lazy API with grouped window functions (cum_sum().over(), shift(), fill_null()). We considered pushing position adjustments into SQL, but scale-type inference happens after query execution, so we'd hit a chicken-and-egg problem. Instead, ~50 lines of polars lazy expressions became ~120 lines of arrow compute calls in stack.rs, using the primitives in compute.rs. dodge.rs and jitter.rs followed the same pattern.
The position-adjustment tests are the primary acceptance criteria here — they encode a lot of tricky numeric behavior (fill/center modes, grouped cumsums, null handling).
Migrations across the codebase
Gotchas worth reviewer attention
Test coverage
What to look at first as a reviewer