Multi-column grouping performance

I’m doing a POC using druid. I set up a cluster same specs as here: https://druid.apache.org/docs/latest/tutorials/cluster.html

I ingested a 2 million row CSV file with 150 columns (75 string, 75 num). It’s 3.5G on disk. This represents one month of data, so I set the granularity to MONTH. 1 segment was created.

I’m running a query to test performance on grouping 8 columns:

select STR_1,STR_2, STR_3, STR_4,STR_5,STR_6,STR_7,STR_8,sum(NUM_1) as Cost

FROM bigData

GROUP BY STR_1,STR_2, STR_3, STR_4,STR_5,STR_6,STR_7,STR_8

ORDER BY Cost DESC

it takes 30 seconds to complete. I’m hoping I can improve on that somehow but I’m not sure how to tune it.

How can I improve performance of this query?

I thought perhaps sharding on other columns to split it up into multiple segments? would/might this help?

Any thoughts?

One thread is used per segment.
Since you have 1 huge segment, which is month; you should try day granularity. So you will have 30/31 segments.

When you query, segments will be processed in parallel.

Regards,Chari.