Skip to content

Druid Modernization Ideas #19039

@jtuglu1

Description

@jtuglu1

Motivation

I'm making a running list of Druid modernization ideas that I see other engines in the space have adopted/are already adopting that Druid could potentially implement as well. Please feel free to comment/add/edit the list of ideas below.

1. Query Latency/Throughput

Without hw-native implementation of physical operators, Druid lags behind other engines that implement query processing codepaths using SIMD/pipelining instructions and other native code speed-ups. Another factor to this conversation is avoiding garbage collection in high allocation/spilling scenarios. Given JDK 22's support for FFI, I wonder if it makes sense to consider adding support for these accelerations in something like Rust and plugging them into the existing Druid query processing path. The realtime streaming path could also potentially benefit from these changes where things like GC spikes can sink your p99 ingest throughput/increase query latencies.

This kind of split between parsing/planning and processing/execution is already being adopted by initiatives like https://github.com/facebookincubator/velox and https://github.com/StarRocks/starrocks.

2. Data ETL in/out of Druid

The Druid segment format, while hyper-optimized for workloads within Druid, serves little value to external ETL services (Spark, etc.) and data manipulation libraries that practitioners are familiar with (pandas, arrow, etc.). I think it would be a good idea to add Apache Arrow reader/writer support for Druid segments (that would allow any 3p system that speaks Arrow to integrate with Druid). It would also open up a path for switching the internal data transfer path (peon/historical -> broker -> router -> client) to use Arrow (instead of json) as well which could potentially speed up queries significantly.

3. CBO for MSQe

Currently the querying setup is split between native engine and MSQE. As MSQE matures, I believe the plan is to deprecate the native engine. To be competitive with other engines like Starrocks, etc. who have CBO/statistics-based planning for queries, I think it would be a good idea to add this to MSQ (this would involve tracking things like query/datasource-level statistics, etc. and exposing them in the planner).

4. Separation of storage/compute

This work has already begun with @clintropolis work on demand-based segment fetching. Not sure if there's anything else to add here.

5. SQL As First-Class Citizen

Engines like Clickhouse, etc. have most/all of their data/cluster configuration through a SQL DDL. This unifies the place where users need to go to make changes and exposes a configuration interface that's familiar to users. I think creating SQL DDLs for configs adjacent to "data" is a good place to start: retention rules, supervisor specs, etc.

6. Internal Metrics Ingestion

Druid should have an option to ingest its own emitted metrics and provide a queryable interface to them through the sys table. This will help Druid be more self-contained; I will note there is merit to decoupling of metrics from the system the metrics are observing, but I can see value for things like smarter janitoring, CBO for queried columns, auto-scaling, etc.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions