Skip to content

feat(py): DataSourceReader bridge for ggsql database pushdown#221

Draft
cpsievert wants to merge 1 commit intofeat/ggsql-integrationfrom
feat/datasource-reader-bridge
Draft

feat(py): DataSourceReader bridge for ggsql database pushdown#221
cpsievert wants to merge 1 commit intofeat/ggsql-integrationfrom
feat/datasource-reader-bridge

Conversation

@cpsievert
Copy link
Copy Markdown
Contributor

Motivation

querychat's current ggsql integration splits execution into two phases: run the SQL on the real database, then replay the VISUALISE portion locally in an in-memory DuckDB. This has two problems:

  1. Scaling — the full SQL result must be pulled into Python memory, even when ggsql's stat transforms (histogram, density, boxplot) would reduce it to a small summary. A histogram of 10M rows pulls all 10M rows into memory only to bin them into ~30 buckets.

  2. Multi-source layers — ggsql supports per-layer data sources (e.g., a CTE fed to a different DRAW clause). The two-phase approach loses intermediate tables at the DataSource boundary, so querychat rejects these queries entirely.

Both problems stem from the same root cause: querychat splits the query at the SQL/VISUALISE boundary rather than letting ggsql run the full pipeline against the real database.

Approach

For SQLAlchemySource data sources, this PR implements a DataSourceReader — a Python object that satisfies ggsql's reader protocol (execute_sql(), register(), unregister()) by routing SQL to the real database via SQLAlchemy. ggsql runs its entire pipeline (parsing, CTEs, stat transforms, layer queries) against the real DB.

sqlglot transpiles ggsql's ANSI-generated SQL to the target database dialect. The dialect mapping covers 22 SQLAlchemy backends, verified by installing the actual driver packages and checking engine.dialect.name.

Falls back to the current two-phase approach when the bridge fails (e.g., temp table permission denied, unsupported dialect, transpilation error) or for non-SQLAlchemy data sources.

Why a separate PR

This is split off from feat/ggsql-integration because the DataSourceReader bridge is Python-specific — it depends on SQLAlchemy and sqlglot, neither of which has an R equivalent. The parent branch (feat/ggsql-integration) contains ggsql prompt/syntax updates and other changes that apply to both R and Python. Keeping this separate makes it clear that the R package doesn't need a corresponding change.

Changes

  • New: _datasource_reader.pyDataSourceReader class, SQLGLOT_DIALECTS mapping (22 dialects), transpile_sql(), register_sqlglot_dialect() for custom additions
  • Modified: _viz_ggsql.pyexecute_ggsql() tries bridge path first, falls back to execute_two_phase() (renamed current logic); logs a warning for unknown dialects
  • Modified: _viz_tools.py, _shiny_module.py — updated callers to pass original query string
  • New dep: sqlglot>=26.0 added to the viz extra (pure Python, zero transitive deps, 6.6 MB)
  • Tests: 19 new tests covering dialect mapping, transpilation, DataSourceReader lifecycle, and end-to-end ggsql integration against real SQLite

Test plan

  • All existing ggsql tests pass with updated 3-arg execute_ggsql() signature
  • New DataSourceReader unit tests pass (SQLite in-memory)
  • End-to-end: ggsql.execute(query, reader) works for scatter, filter, Form B, aggregation
  • Manual test with a real Snowflake/Postgres connection (TODO)

🤖 Generated with Claude Code

Implements a DataSourceReader that routes ggsql's full pipeline through
the real database via SQLAlchemy, using sqlglot for dialect transpilation.
For SQLAlchemySource with a known dialect, ggsql runs CTEs, stat
transforms, and layer queries directly on the real DB — avoiding the
need to pull large result sets into Python memory. Falls back to the
existing two-phase approach (now `execute_two_phase`) for other
DataSource types or on bridge failure.

Includes verified dialect mappings for 22 SQLAlchemy backends and
register_sqlglot_dialect() for custom additions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant