Harden ValidateSQLRequest as the single SQL-safety chokepoint (move SHOW/RBAC gating off raw-regex)

## Problem

The query API enforces its SQL-safety invariants — RBAC table extraction, SHOW-command gating, cross-database rejection, dangerous-keyword denylist, multi-statement rejection, `read_parquet`/`arc_partition_agg` blocking — with **regex and string-scanning over the raw SQL**, not a real parser. This is an inherently leaky abstraction: each invariant has to independently re-derive "what is a token / literal / comment / statement boundary," and every divergence from real SQL tokenization is a potential bypass.

PR #489 (a small SHOW-gating change) made this concrete. It took **7 Gemini rounds**, each surfacing a distinct tokenizer edge case in the same gate:

1. Comment hides the command — `/* x */ SHOW DATABASES` (regex on raw SQL).
2. Quoted db name not recognized — `SHOW TABLES FROM "db"` (char class too narrow).
3. Path traversal — `SHOW TABLES FROM ..` (char class too wide; no identifier validation).
4. Multi-statement smuggling — `SHOW DATABASES; SELECT 1` (anchored `$` assumes one statement).
5. Whitespace before paren in TVF detection — `generate_series  (...)`.
6. Endpoint parity — Arrow endpoint had no SHOW gate at all.
7. Comment marker inside a quoted identifier — `SHOW TABLES FROM "my--db"` (stripped comments before masking literals → truncation → wrong db authorized).

Each was individually fixed and is covered by a regression test, and the **defensive posture held throughout** (Arc databases are virtual / never `ATTACH`ed, so a missed SHOW returns DuckDB's empty catalog, not tenant data; the denylist blocks destructive statements). So these were defense-in-depth hardening, not live tenant-data leaks. But the *rate* of new edge cases in one small PR is the signal: the approach doesn't converge.

## Proposed direction

Treat **`ValidateSQLRequest` as the single SQL-safety chokepoint** and make the SQL handling robust rather than regex-by-feature:

1. **Normalize once, centrally.** All endpoints already call `ValidateSQLRequest`. Do the literal-masking + comment-stripping there once, and have it expose the normalized form (and/or a small set of derived facts: statement count, leading keyword, whether it's a SHOW/metadata command) so downstream gates don't each re-scan raw SQL. The `normalizeSQLForShow` helper added in #489 is a step toward this — generalize it.
2. **Move SHOW / metadata detection off raw-regex** onto the normalized form (or a lightweight statement classifier). The anchored-regex approach keeps sprouting edge cases (rounds 1, 2, 4, 7 were all "the regex saw the wrong string").
3. **Evaluate a real parser/lexer for the safety layer.** DuckDB's own parser, a Go SQL tokenizer, or `EXPLAIN`-based statement classification would replace the hand-rolled token scanning that RBAC depends on. Scope/perf to be assessed — the hot path matters (current extraction is 6–30 µs).
4. **Property/fuzz test the chokepoint.** A fuzz target over `ValidateSQLRequest` + the SHOW gate (comments, quotes, mixed quoting, statement separators, unicode, whitespace) would surface these classes before review does.

## Relationship to other work

- Complements the [SQL parser modernization initiative] and #490 (make `duckdb_arrow` the default so default `go test` exercises the shipped path — round 6's Arrow-parity gap was invisible to untagged tests).
- This is the durable fix for the "whack-a-mole with Gemini on tokenizer nits" pattern. Not urgent (no known live bypass), but it's the root cause behind a recurring review cost.

## Acceptance criteria (direction-setting; refine when scheduled)

- [ ] SQL-safety invariants (RBAC extraction, SHOW gating, statement-count, denylist) operate on a single, centrally-produced normalized representation — no gate re-scans raw `req.SQL`.
- [ ] SHOW/metadata-command detection no longer depends on anchored regex over raw SQL.
- [ ] A fuzz/property test covers the normalization + gating against comment/quote/separator/whitespace permutations.
- [ ] No regression in the existing `query_rbac_test.go` / `TestValidateSQLRequest_*` / `TestShowTablesPattern` / `TestNormalizeSQLForShow` suites.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden ValidateSQLRequest as the single SQL-safety chokepoint (move SHOW/RBAC gating off raw-regex) #491

Problem

Proposed direction

Relationship to other work

Acceptance criteria (direction-setting; refine when scheduled)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Harden ValidateSQLRequest as the single SQL-safety chokepoint (move SHOW/RBAC gating off raw-regex) #491

Description

Problem

Proposed direction

Relationship to other work

Acceptance criteria (direction-setting; refine when scheduled)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions