feat(py): DataSourceReader bridge for ggsql database pushdown#221
Draft
cpsievert wants to merge 1 commit intofeat/ggsql-integrationfrom
Draft
feat(py): DataSourceReader bridge for ggsql database pushdown#221cpsievert wants to merge 1 commit intofeat/ggsql-integrationfrom
cpsievert wants to merge 1 commit intofeat/ggsql-integrationfrom
Conversation
Implements a DataSourceReader that routes ggsql's full pipeline through the real database via SQLAlchemy, using sqlglot for dialect transpilation. For SQLAlchemySource with a known dialect, ggsql runs CTEs, stat transforms, and layer queries directly on the real DB — avoiding the need to pull large result sets into Python memory. Falls back to the existing two-phase approach (now `execute_two_phase`) for other DataSource types or on bridge failure. Includes verified dialect mappings for 22 SQLAlchemy backends and register_sqlglot_dialect() for custom additions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
querychat's current ggsql integration splits execution into two phases: run the SQL on the real database, then replay the VISUALISE portion locally in an in-memory DuckDB. This has two problems:
Scaling — the full SQL result must be pulled into Python memory, even when ggsql's stat transforms (histogram, density, boxplot) would reduce it to a small summary. A histogram of 10M rows pulls all 10M rows into memory only to bin them into ~30 buckets.
Multi-source layers — ggsql supports per-layer data sources (e.g., a CTE fed to a different DRAW clause). The two-phase approach loses intermediate tables at the DataSource boundary, so querychat rejects these queries entirely.
Both problems stem from the same root cause: querychat splits the query at the SQL/VISUALISE boundary rather than letting ggsql run the full pipeline against the real database.
Approach
For
SQLAlchemySourcedata sources, this PR implements aDataSourceReader— a Python object that satisfies ggsql's reader protocol (execute_sql(),register(),unregister()) by routing SQL to the real database via SQLAlchemy. ggsql runs its entire pipeline (parsing, CTEs, stat transforms, layer queries) against the real DB.sqlglot transpiles ggsql's ANSI-generated SQL to the target database dialect. The dialect mapping covers 22 SQLAlchemy backends, verified by installing the actual driver packages and checking
engine.dialect.name.Falls back to the current two-phase approach when the bridge fails (e.g., temp table permission denied, unsupported dialect, transpilation error) or for non-SQLAlchemy data sources.
Why a separate PR
This is split off from
feat/ggsql-integrationbecause the DataSourceReader bridge is Python-specific — it depends on SQLAlchemy and sqlglot, neither of which has an R equivalent. The parent branch (feat/ggsql-integration) contains ggsql prompt/syntax updates and other changes that apply to both R and Python. Keeping this separate makes it clear that the R package doesn't need a corresponding change.Changes
_datasource_reader.py—DataSourceReaderclass,SQLGLOT_DIALECTSmapping (22 dialects),transpile_sql(),register_sqlglot_dialect()for custom additions_viz_ggsql.py—execute_ggsql()tries bridge path first, falls back toexecute_two_phase()(renamed current logic); logs a warning for unknown dialects_viz_tools.py,_shiny_module.py— updated callers to pass original query stringsqlglot>=26.0added to thevizextra (pure Python, zero transitive deps, 6.6 MB)Test plan
execute_ggsql()signatureggsql.execute(query, reader)works for scatter, filter, Form B, aggregation🤖 Generated with Claude Code