-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Description
Add an optional mechanism to automatically sample (drop a configurable fraction of) incoming rows during streaming ingestion when the ingestion rate exceeds a configurable threshold. When the rate is below the threshold, all rows would be ingested as normal; when it exceeds the threshold, the system would apply sampling to reduce load.
This would allow users to maintain ingestion stability during traffic spikes or bursts, trading some data completeness for system stability when scaling out (via the existing task autoscaler) is not sufficient or desirable.
Motivation
Use case: Streaming ingestion from Kafka or Kinesis can experience sudden spikes in event volume. When incoming rate exceeds what the cluster can sustainably process, lag accumulates and tasks may fail, backpressure builds, or the system becomes unstable. Today, Druid addresses this primarily by scaling out (adding more ingestion tasks via the autoscaler), but there are scenarios where:
- Additional task capacity is not immediately available (e.g., worker slots constrained)
- Users prefer to sample during bursts rather than risk ingestion failures or growing lag
- The use case tolerates approximate data (e.g., metrics, sampling-friendly analytics)
Rationale: Providing a built-in, configurable way to sample when rate exceeds a threshold would give operators a predictable, observable fallback when the stream overwhelms available capacity. It would complement (not replace) the existing autoscaling approach—users could configure both and have sampling kick in only when scaling alone is insufficient.
Benefit: Users would have a declarative way to prioritize ingestion stability over data completeness during overload, with sampled events tracked in existing metrics (e.g., ingest/events/thrownAway with a distinct reason) for visibility.
Implementation considerations
Any implementation would likely build on existing building blocks: the row-level filtering applied during streaming ingestion (e.g., InputRowFilter, FilteringCloseableInputRowIterator), the throughput and thrown-away metrics already exposed per task (RowIngestionMeters, DropwizardRowIngestionMeters with its moving averages), and the streaming tuning config for new parameters (SeekableStreamIndexTaskTuningConfig). The cost-based and lag-based autoscalers already consume task-level stats for scaling decisions—similar metrics could inform when sampling is warranted.