I’'m having the issue of groupby performance and I’m not quite sure if it’s the nature of Druid or it’s because I didn’t config it properly. When the output number of rows increases, the groupby query latency increases a lot. In one experiment, the data (after injection) is around 300M, when we were doing a 2-dim groupby with 20 rows of output, it takes only <0.1s. However when we were doing a 2-dim groupby query with 1M of rows as output, it takes 50s to finish. I did all the configuration according to the Druid Production Cluster Configuration. http://druid.io/docs/latest/Production-Cluster-Configuration.html Does it make sense to Druid or I should probably change the configuration?
The current use case for us is trying to efficiently extract a fact table from a large hive data with schemas like follows.
D_1, D_2, D_3, … D_k, M_1, M_2, M_3, …, M_l
where D_1,…,D_k are dimensions that we are interested in doing slice n dice, M_1, M_2…, M_l are the metrics we will try to aggregate on.
Any suggestions for performance improvement?