I have 105867563 rows of sales data, When I query i.e (select count (1) from (select distinct column1, column2, column3, column4 column5 from SALES where saledate between timestamp 18-01-01 00:00:00 and 18-12-31 00:00:00)); it takes 60 to 70 seconds to give the query result.
where column1…column5 are dimensions and saledate is timestamp.
What can I do for getting result within milliseconds?
Disclaimer: This is the route we took to tweak ours. A word from contributors would be nice.
How big are your segments? And how many segments do you have? Can you first try to do these -
Make sure the size of your segments is between 300MB to 700MB.
Make sure you have enough memory to hold all the segments.
You might have to tweak segment granularity to tune the above two
Idea is, to have as little segments as possible and make sure that the segments you are reading from can be held in memory. If you memory is smaller than the total segments it has to be read, there will be paging which will have a performance hit.
What is that mean of point no.2?
The processing and merge buffers are used to hold the segments in memory. If your segments you want to frequently query are say, 10 GB, you want to make sure there is enough direct memory to hold 10GB in memory.
Give this a good detailed read - https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html
In particular, check out historical heap sizing and direct memory sizing.