-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Motivation
I'm making a running list of Druid modernization ideas that I see other engines in the space have adopted/are already adopting that Druid could potentially implement as well. Please feel free to comment/add/edit the list of ideas below.
1. Query Latency/Throughput
Without hw-native implementation of physical operators, Druid lags behind other engines that implement query processing codepaths using SIMD/pipelining instructions and other native code speed-ups. Another factor to this conversation is avoiding garbage collection in high allocation/spilling scenarios. Given JDK 22's support for FFI, I wonder if it makes sense to consider adding support for these accelerations in something like Rust and plugging them into the existing Druid query processing path. The realtime streaming path could also potentially benefit from these changes where things like GC spikes can sink your p99 ingest throughput/increase query latencies.
This kind of split between parsing/planning and processing/execution is already being adopted by initiatives like https://github.com/facebookincubator/velox and https://github.com/StarRocks/starrocks.
2. Data ETL in/out of Druid
The Druid segment format, while hyper-optimized for workloads within Druid, serves little value to external ETL services (Spark, etc.) and data manipulation libraries that practitioners are familiar with (pandas, arrow, etc.). I think it would be a good idea to add Apache Arrow reader/writer support for Druid segments (that would allow any 3p system that speaks Arrow to integrate with Druid). It would also open up a path for switching the internal data transfer path (peon/historical -> broker -> router -> client) to use Arrow (instead of json) as well which could potentially speed up queries significantly.
3. CBO for MSQe
Currently the querying setup is split between native engine and MSQE. As MSQE matures, I believe the plan is to deprecate the native engine. To be competitive with other engines like Starrocks, etc. who have CBO/statistics-based planning for queries, I think it would be a good idea to add this to MSQ (this would involve tracking things like query/datasource-level statistics, etc. and exposing them in the planner).
4. Separation of storage/compute
This work has already begun with @clintropolis work on demand-based segment fetching. Not sure if there's anything else to add here.
5. SQL As First-Class Citizen
Engines like Clickhouse, etc. have most/all of their data/cluster configuration through a SQL DDL. This unifies the place where users need to go to make changes and exposes a configuration interface that's familiar to users. I think creating SQL DDLs for configs adjacent to "data" is a good place to start: retention rules, supervisor specs, etc.
6. Internal Metrics Ingestion
Druid should have an option to ingest its own emitted metrics and provide a queryable interface to them through the sys table. This will help Druid be more self-contained; I will note there is merit to decoupling of metrics from the system the metrics are observing, but I can see value for things like smarter janitoring, CBO for queried columns, auto-scaling, etc.