Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions docs/plans/2026-04-17-datasource-reader-bridge-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# DataSourceReader Bridge Design

## Problem

querychat executes ggsql queries in two phases: run the SQL on the real database, then replay the VISUALISE portion locally against the result in an in-memory DuckDB. This has two drawbacks:

1. **Scaling** — the full SQL result must be pulled into Python memory, even when ggsql's stat transforms (histogram, density, boxplot) would reduce it to a small summary. A histogram of 10M rows pulls all 10M rows into memory only to bin them into ~30 buckets.

2. **Multi-source layers** — ggsql supports per-layer data sources (e.g., a CTE fed to a different DRAW clause). The two-phase approach loses intermediate tables at the DataSource boundary, so querychat rejects these queries.

Both problems stem from the same root cause: querychat splits the query at the SQL/VISUALISE boundary and runs each half independently, rather than letting ggsql run the full pipeline against the real database.

## Solution

For `SQLAlchemySource` data sources, implement a `DataSourceReader` — a Python object that satisfies ggsql's reader protocol (`execute_sql()`, `register()`, `unregister()`) by routing SQL to the real database. Pass this reader to `ggsql.execute(query, reader)`, letting ggsql run the entire pipeline (parsing, CTEs, stat transforms, everything) against the real DB.

Use [sqlglot](https://github.com/tobymao/sqlglot) to transpile ggsql's ANSI-generated SQL to the target database dialect. This gives broad database coverage (31 dialects) without waiting for ggsql to add each one.

Fall back to the current two-phase approach when the bridge fails (e.g., temp table permission denied, unsupported dialect, transpilation error) or for non-SQLAlchemy data sources.

## Data flow

### Bridge path (SQLAlchemySource)

```
ggsql.execute(query, DataSourceReader)
├─ CTE materialization
│ execute_sql("SELECT … FROM orders GROUP BY …")
│ → sqlglot transpiles generic → target dialect
│ → runs on real DB
│ → result registered as temp table on real DB
├─ Global SQL
│ execute_sql("SELECT * FROM orders WHERE …")
│ → runs on real DB
│ → result registered as temp table on real DB
├─ Schema queries
│ execute_sql("SELECT … LIMIT 0")
│ → runs on real DB against temp tables
├─ Stat transforms (histograms, density, boxplot, etc.)
│ execute_sql("WITH … SELECT … binning SQL …")
│ → sqlglot transpiles generated ANSI SQL → target dialect
│ → runs on real DB against temp tables
└─ Final layer queries
execute_sql("SELECT …")
→ runs on real DB, small result set returned
```

### Fallback path (current approach, all DataSource types)

```
validated.sql()
→ DataSource.execute_query() on real DB
→ full result pulled into Python memory
→ registered in local DuckDB
→ ggsql replays VISUALISE portion locally
```

## Components

### `DataSourceReader`

Python class implementing ggsql's reader protocol. Lives in `_viz_ggsql.py`.

- **Constructor**: takes a `sqlalchemy.Engine` and a sqlglot dialect string. Opens a single connection from the engine, held for the pipeline's duration.
- **`execute_sql(sql)`**: transpiles from generic SQL to target dialect via `sqlglot.transpile(sql, read="", write=dialect)`, executes on the real DB via SQLAlchemy, returns a polars DataFrame.
- **`register(name, df, replace)`**: creates a `TEMPORARY TABLE` on the real DB with column types derived from polars dtypes (generic SQL types, transpiled by sqlglot). Inserts rows in batches via SQLAlchemy. Tracks registered names for cleanup.
- **`unregister(name)`**: drops the temp table on the real DB.
- **Context manager**: `__exit__` drops all registered temp tables and closes the connection, ensuring cleanup even on error.

### Dialect mapping

```python
SQLGLOT_DIALECTS = {
"postgresql": "postgres",
"snowflake": "snowflake",
"duckdb": "duckdb",
"sqlite": "sqlite",
"mysql": "mysql",
"mssql": "tsql",
"bigquery": "bigquery",
"redshift": "redshift",
}
```

Maps `engine.dialect.name` to sqlglot dialect names. Unknown dialects skip the bridge and use the fallback.

### Entry point

```python
def execute_ggsql(data_source, query, validated):
if isinstance(data_source, SQLAlchemySource):
dialect = SQLGLOT_DIALECTS.get(data_source._engine.dialect.name)
if dialect is not None:
try:
with DataSourceReader(data_source._engine, dialect) as reader:
return ggsql.execute(query, reader)
except Exception:
pass # fall through

# Fallback: current two-phase approach
return _execute_two_phase(data_source, validated)
```

### `_execute_two_phase`

The current `execute_ggsql` body, renamed. Includes the existing regex-based `extract_visualise_table` and `has_layer_level_source` logic. Used for `DataFrameSource`, `PolarsLazySource`, `IbisSource`, and as the fallback for SQLAlchemy sources.

## Dependencies

- `sqlglot` added to the `viz` optional extra in `pyproject.toml`
- No changes to ggsql required for the initial implementation

## Scope boundaries

- **SQLAlchemySource only** — IbisSource could follow later
- **No ggsql changes required** — the `dialect` parameter contribution to `execute()` can come later as an optimization (skipping sqlglot when ggsql natively supports the dialect)
- **No prompt changes** — the LLM already writes SQL for the correct `db_type`

## Testing

- Unit tests for `DataSourceReader`: mock SQLAlchemy connection, verify transpile + execute, register/unregister lifecycle, cleanup on error
- Unit tests for sqlglot transpilation of ggsql's generated SQL patterns (recursive CTEs, NTILE percentiles, CREATE TEMPORARY TABLE) across key dialects
- Integration test for fallback: verify bridge failure triggers two-phase approach
- End-to-end with a test database connection if available
Loading
Loading