Broker merge then sort ?!

Hey guys,

After some tests, it appears like the broker merge all segments results anytime and i’m a bit confuse at this point, let me explain.

We have a use case where the segment granularity is “day”, and we launch a query on a week of data but we need the result for each day.

So technically the broker receive the query, forward to each segment (7) in the week of data, then merge the results and sort per day.

We know that because when we ask for the result to be weekly, it is faster than per day result.

So here are my questions:

1- Why query a week of data (when segment granularity is “day”) is faster when we ask for the result to be weekly than ask for the result to be per day?

  • The broker doesn’t sort the result of merged segments results when the interval match the period ? In this case, why the broker merge the result of segments when we ask for day result ?
  • The broker sort the result in the two cases but it is much faster when the period is one week than a period of one day ?
    2- If the broker merge and sort every time, can we imagine, in the future, to build this kind of intelligence in druid (if the query period match the segment granularity then don’t merge and sort) or is it way too complex?

Many thanks guys,

Ben

The query is fanned out to all the historical nodes (and realtime if you have them) and they all answer the question individually. So if you have 30 nodes then you’ll have 30 timeseries results that need to be merged together (assuming populateCache is FALS at the broker level). The broker merges the stuff together in the process of returning the results.

One caveat is that the broker can’t merge results until it has them. Even the ordering that things are processed are not guaranteed unless you ask them to be. The ability to best merge results or optimize streaming results back hasn’t been tackled yet, to my knowledge… mostly because general optimizations to overall query speed have been investigated instead.

Does that at least partly answer the question?

Hi,

Yeah, thanks, it helps a lot in the understanding of druid.
But i there any chance to an improvment or is it to complex?
I explain: In our use case we have a segment per day in one datasource “visits”. Now suppose i want to have the number of visits for each day in the past 7 days. Why the broker need to wait all results, merge it and then sort per day? Can’t we tell him to not merge the results of each segments? (just a supposition, i have really no idea about the difficulty)

Thanks again !

Hey Ben,

What query type are you using? Some query types already avoid materializing and sorting on the broker, like timeseries. groupBy is the main offender in terms of sorting needlessly- see https://github.com/druid-io/druid/issues/2987 for more details on how groupBy works. The needless sorting will be fixed in the future, but merging will keep happening (although merging without sorting isn’t too expensive).

Fwiw this is one of the reasons that the Druid docs say to use timeseries or topN instead of groupBy when possible.

Hi Gian,

Thanks for your answer, it’s clear now. So the broker will merge results all the time but when sorting is not needed it can be handle, like in timeseries and topN when possible.

My tests were as follows:

1- Having 7 days of data with 1 day segmentGranularity, so 7 segments. GroupBy on 3 dimensions, and asking the result for each day.

2- Having 7 days of data with 1 week segmentGranularity, so 1 segment. Same GroupBy on 3 dimensions and asking the result for each day.

I was expecting 1 to be faster because historicals send the final result and the broker don’t need anything to do but it appaers like 2 is faster than 1.

So i supposed that the broker merge and sort every time even when it’s not needed.

Ben