Skip to content

Harden ValidateSQLRequest as the single SQL-safety chokepoint (move SHOW/RBAC gating off raw-regex) #491

@xe-nvdk

Description

@xe-nvdk

Problem

The query API enforces its SQL-safety invariants — RBAC table extraction, SHOW-command gating, cross-database rejection, dangerous-keyword denylist, multi-statement rejection, read_parquet/arc_partition_agg blocking — with regex and string-scanning over the raw SQL, not a real parser. This is an inherently leaky abstraction: each invariant has to independently re-derive "what is a token / literal / comment / statement boundary," and every divergence from real SQL tokenization is a potential bypass.

PR #489 (a small SHOW-gating change) made this concrete. It took 7 Gemini rounds, each surfacing a distinct tokenizer edge case in the same gate:

  1. Comment hides the command — /* x */ SHOW DATABASES (regex on raw SQL).
  2. Quoted db name not recognized — SHOW TABLES FROM "db" (char class too narrow).
  3. Path traversal — SHOW TABLES FROM .. (char class too wide; no identifier validation).
  4. Multi-statement smuggling — SHOW DATABASES; SELECT 1 (anchored $ assumes one statement).
  5. Whitespace before paren in TVF detection — generate_series (...).
  6. Endpoint parity — Arrow endpoint had no SHOW gate at all.
  7. Comment marker inside a quoted identifier — SHOW TABLES FROM "my--db" (stripped comments before masking literals → truncation → wrong db authorized).

Each was individually fixed and is covered by a regression test, and the defensive posture held throughout (Arc databases are virtual / never ATTACHed, so a missed SHOW returns DuckDB's empty catalog, not tenant data; the denylist blocks destructive statements). So these were defense-in-depth hardening, not live tenant-data leaks. But the rate of new edge cases in one small PR is the signal: the approach doesn't converge.

Proposed direction

Treat ValidateSQLRequest as the single SQL-safety chokepoint and make the SQL handling robust rather than regex-by-feature:

  1. Normalize once, centrally. All endpoints already call ValidateSQLRequest. Do the literal-masking + comment-stripping there once, and have it expose the normalized form (and/or a small set of derived facts: statement count, leading keyword, whether it's a SHOW/metadata command) so downstream gates don't each re-scan raw SQL. The normalizeSQLForShow helper added in fix(api): add SHOW command RBAC-gating to estimateQuery endpoint #489 is a step toward this — generalize it.
  2. Move SHOW / metadata detection off raw-regex onto the normalized form (or a lightweight statement classifier). The anchored-regex approach keeps sprouting edge cases (rounds 1, 2, 4, 7 were all "the regex saw the wrong string").
  3. Evaluate a real parser/lexer for the safety layer. DuckDB's own parser, a Go SQL tokenizer, or EXPLAIN-based statement classification would replace the hand-rolled token scanning that RBAC depends on. Scope/perf to be assessed — the hot path matters (current extraction is 6–30 µs).
  4. Property/fuzz test the chokepoint. A fuzz target over ValidateSQLRequest + the SHOW gate (comments, quotes, mixed quoting, statement separators, unicode, whitespace) would surface these classes before review does.

Relationship to other work

Acceptance criteria (direction-setting; refine when scheduled)

  • SQL-safety invariants (RBAC extraction, SHOW gating, statement-count, denylist) operate on a single, centrally-produced normalized representation — no gate re-scans raw req.SQL.
  • SHOW/metadata-command detection no longer depends on anchored regex over raw SQL.
  • A fuzz/property test covers the normalization + gating against comment/quote/separator/whitespace permutations.
  • No regression in the existing query_rbac_test.go / TestValidateSQLRequest_* / TestShowTablesPattern / TestNormalizeSQLForShow suites.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsecuritySecurity vulnerability or hardening

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions