Large number of rows in single segment performance issue


Our service has a time-dependent workload.
For example, the traffic between 1:00 and 2:00 is 10 times the traffic between 3:00 and 4:00. (10x traffic)

It creates a large number of rows(more than 10M rows) in a single time chunk.
When I query this single segment, I found it spends more than 5 seconds.

I’m already using range partitioning.
I tried to use maxRowsPerSegment option to partition multiple segments, but it did not work.
As I know, another option is reducing segment granularity from PT1H to PT30M.
But I think this will create many segments even though there are low traffic time range.

In this case, what is the best choice to optimize query performance?

Is rollup possible in your use case? That might reduce your number of rows.

Sounds like you could use streaming autoscaler.
One thing to notice is that you will need enough worker capacity to address the max number of tasks in the autoscaler. So if you are running on k8s you might also want to setup a Horizontal Pod Autoscaler for MMs.
More info on autoscaler.

It’s not realtime indexing, it’s batch indexing.
I’m already using perfect rollup.
Just there are so many high dimension cardinality events at same time bucket.

What about sketches? Their two main purposes are to improve rollup and reduce memory footprint at query time.

Sadly, I can’t use sketches. I need to support exact dimension and aggregation result.

This is a classic case for compaction. You can have the compaction task ‘compact’ the larger segments into more efficiently sized ones after the fact. This can run in the background.

Documentation for easy reference: Compaction · Apache Druid

1 Like