Acceptable query time for data worth 6 months

Hi,

I have another thread regarding this same topic here : https://groups.google.com/forum/#!topic/druid-user/AHiz7Kl5Y1k. I want to keep it to the point so creating this thread.

Following is the description of my datasource

  • Number of segments : 1634

  • Total Size : 11.4 MB

  • average size of each segment: few KBs (less than 20KBs)

  • time range: from February 2019 to today

Ingestion:

  • segment granularity is set to 15 minutes (specified in tranquility config)

  • Number of dimensions specified : 20. (around 7-8 are high cardinality dimensions (UUIDs), others are dimensions like browserName, osType, locale etc.)

My current requirement is to query data with query granularity of 1 Day. We are storing data with segment granularity of 15 minutes keeping in mind future requirements.

I started noticing that querying from druid was taking long time. some of the queries could take few milliseconds (sometimes more than 200) while some were taking couple of seconds. I have described the problem in detail here: https://groups.google.com/forum/#!topic/druid-user/Va7ZLVzax7M

what i wanted to know was if i have multiple segments (15 mins each) sized just few KBs and if i am querying 6 months of data, how long should the historical node take to fetch the results for the query? I would typically run groupBy queries with 4-5 high cardinality dimensions.

If the Historical node is running on a node with 8 cores and 32GB Ram and started with the out of box configuration that is specified in the distribution, should a query for data over 6 months take few seconds (5+ seconds). How can this speed up?

Thanks,

Prathamesh

Hi,

I would suggest to compact segments first. It will definitely speed up your query without impacting different query granularities in the future.

Best,

Jay

Hi Jay,

By compact you do mean re-index with granularity of 1 Day or 1 Hour?

It makes sense that it would speed up queries since druid would have fewer segments to work with. But does that mean a granularity of 15 minutes always leads to slower queries? Is the main reason for slow queries is segments smaller than the optimum 300-700 mb standard ?

Also, if I do compact the segments to say 1 Day. How would I be able to query data for 15 minutes granularity? What I want to plot data for every 15 minute interval?

Thanks,
Prathamesh

Compaction only compact segments, does not impact query granularity.

On druid console, http://{host}:8888/unified-console.html#datasources, there is a compaction column that you can edit your own compaction config for each data source. Your segments are so small, probably can compact a whole month still within a recommended size.

I didn't get the last part.

So compaction will lead to fewer larger sized segments from the current smaller segments while keeping segment granularity intact?

Currently my segment granularity was 15 minutes. If I compact the segments, I would still b able to query with that granularity?

Thanks,
Prathamesh

Yeses, just like query by a min even segment is 15 mins each. Compacted segment will speed up the loading, and speed up the query.

Hi Prathamesh,

At your data size, I would highly recommend using a much larger segment granularity than 15 minutes. This could either be done with re-indexing (depending on your ingestion process) or with compaction as Jay suggested which is documented at http://druid.apache.org/docs/latest/ingestion/compaction.html. Also to elaborate on what Jay mentioned, segment granularity has no impact on query granularity and does not need to match, it’s a mechanism to control how your data is broken down into time chunks to influence segment size, so you can keep the 15 minute query granularity (or probably have even finer granularity such as minutes or seconds if it’s contained in your data), but use something like week or month sized segment granularity. This will produce far fewer segments which I think should help the poor performance you are seeing, where the bottleneck is the large number of result merges that need done.

We generally recommend segments have a few million rows each, which with typical datasets usually comes in between 300-700MB per segment. See https://druid.apache.org/docs/latest/design/segments.html for details. With the upcoming 0.15 release, which I expect to be announced in the next day or so, we’ve also improved the docs around tuning Druid. The docs aren’t published at the time I write this, but the ones most relevant for you will be available at these links once the release is finished: https://druid.apache.org/docs/latest/operations/single-server.html and https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html. You can currently find them in a much less friendly form at https://github.com/apache/incubator-druid/blob/master/docs/content/operations/single-server.md and https://github.com/apache/incubator-druid/blob/master/docs/content/operations/basic-cluster-tuning.md.

Hi Clint,

Thank you for the detailed explanation. I do have some follow up queries.

The reason i had kept segment granularity as 15 minutes was that in the docs it was mentioned that query granularity cannot be smaller than segment granularity. From druid docs:

Having a query granularity smaller than the ingestion granularity doesn’t make sense,
because information about that smaller granularity is not present in the indexed data.
So, if the query granularity is smaller than the ingestion granularity, druid produces
results that are equivalent to having set the query granularity to the ingestion granularity.

While we currently only required DAY granularity(for charts) along with aggregated data over few months we wanted to be able to query at atleast 15 minutes granularity if required in future. I thought once data is ingested with segment granularity of HOUR or DAY, we will not be able to run queries with granularity 15 minutes and get that breakdown. Are you saying that is incorrect? I may have interpreted the above quoted portion of docs incorrectly. **Is ingestion granularity and segment granularity different things? I thought druid uses the segment granularity to persist data in the required time window (bucket) in order to allow fast retrieval for the specific intervals. I thought in order to be able to query for certain dimensions and get a breakdown at 15 mins level the segment granularity HAD to be 15 minutes. My understanding was if segment granularity was specified as HOUR, druid would create only 1 segment for that hour leading to inability to query at 15 minutes granularity. Can you confirm that this understanding was incorrect? **

By the way i ingest data using tranquility.

“granularitySpec”:{
“type”:“uniform”,
“segmentGranularity”:“fifteen_minute”,
“queryGranularity”:“fifteen_minute”
},

``

Thanks,

Prathamesh

Ah, I think the docs are perhaps ambiguous here. This means the value of ‘granularity’ at query time cannot produce results with a smaller granularity than the value of ‘queryGranularity’ at ingestion time, and is not referring to the ingestion time ‘queryGranularity’ and ‘segmentGranularity’ parameters. I’ve raised a PR to clarify the docs, https://github.com/apache/incubator-druid/pull/7977

Hi Clint,

So, if ,

During ingestion time: segment granularity =1 DAY and query granularity = 15 Minutes

During Query time: Queries can be for granularity of 15 Minutes and above i.e. HOUR, DAY, WEEK so on

During ingestion time: segment granularity =1 DAY and query granularity = HOUR

During Query time: Queries can be for granularity of 1 HOUR and above i.e. DAY, WEEK so on bur not below HOUR, i.e. 15 Minute, 1 Minute

Clint could you please validate this?

Thanks,

Prathamesh

Yep, that’s correct