Skip to content

SQL Captures: Improve Discovery Scalability and Filter Options #4476

@willdonnelly

Description

@willdonnelly

Some people have absolutely massive SQL databases, such that our discovery logic tends to time out or OOM when trying to process everything.

Our SQL database discovery operations are typically structured as a set of "list all tables; list all columns of all tables; list all keys of all tables" queries followed by an aggregation step which combines all of those results. This is O(1) in the number of tables and thus theoretically attractive, but we're seeing the downsides of that approach here.

In the past we have implemented stopgap solutions to this by pushing down specific filter into the database discovery queries, typically in the form of a whitelist of schema names. However this isn't scalable to more fancy filtering because it requires that any filter options be something which we can push off onto the database and that we be able to combine them all into every discovery query.

Listing Tables and Chunking

Part of the solution here is that we need to stop trying to fetch all info for all tables in parallel and instead we must be more deliberate about what we need and when.

Concretely, we should list all tables in the database (optionally pushing down some or all filter options into that query) and then we should fetch anything else (columns, keys, indices, etc) for just the filtered list of tables which we intend to discover, possibly using chunking if required (e.g. "list columns for the next 100 tables in this list").

The upshot of this is that simply handling a list of table names isn't all that expensive even at typical scales of ~100k tables (we're talking like, 5-10MiB worst case), our filter options can basically all be applied to that list, and then the other queries don't have to care about filtering beyond table identity any more.

Additional Filter Options

Once the fetching of full table details is decoupled from the selection of what tables we care about, we can use that to implement more powerful filtering. Currently "Discover Schemas" lives in "Advanced Options" but it already feels a bit out of place and adding two more discovery-filter settings in there will just make that worse, so the plan is:

  • Add a new "Discovery Filters" section to the endpoint config with options:
    • "Include Schemas" and "Exclude Schemas": Whitelist/blacklist literal strings.
    • "Table Patterns": A list of table-name regexes. Applied with implicit ^$ anchors to the bare table names so that foo_bar will by default match any table named <anyschema>.foo_bar
    • Duplicates of connector-specific options like "Discover Unpublished Tables", where present
  • Leave the existing advanced options where they are for the moment, but union the old Discover Schemas setting with the new Discovery Filters > Include Schemas for compatibility
  • Plan on hiding the old option(s) and doing a bulk-edit to move any custom settings to the new location(s) in the near future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions