I’m doing a POC using druid. I set up a cluster same specs as here: https://druid.apache.org/docs/latest/tutorials/cluster.html
I ingested a 2 million row CSV file with 150 columns (75 string, 75 num). It’s 3.5G on disk. This represents one month of data, so I set the granularity to MONTH. 1 segment was created.
I’m running a query to test performance on grouping 8 columns:
select STR_1,STR_2, STR_3, STR_4,STR_5,STR_6,STR_7,STR_8,sum(NUM_1) as Cost
GROUP BY STR_1,STR_2, STR_3, STR_4,STR_5,STR_6,STR_7,STR_8
ORDER BY Cost DESC
it takes 30 seconds to complete. I’m hoping I can improve on that somehow but I’m not sure how to tune it.
How can I improve performance of this query?
I thought perhaps sharding on other columns to split it up into multiple segments? would/might this help?