Performance issues on query

Hello everyone,

I set up a cluster and i am facing some issues when I query my datasources.

My cluster is (saw on imply):

Master Server with Coordinator, Overlord, MetadataStorage and Zookeeper on a r3.2xlarge instance.

Data Server with MiddleManager and Historical on a r3.2xlarge instance.

Query Server with Broker and Pivot on a r3.2xlarge instance.

We use an Hadoop cluster for the indexing tasks.

I created different databases to see how my cluster handle different queries on different databases.

Basically one day of events is 8 millions rows.

1- Typically, our datas are stored in “Visites” database with 96 dimensions and 5 metrics. The segment granularity is “day” and the query granularity is “minute”. Resulting in segments of 450~700 MB.

I tried with 1 historical node on this database

***1: ***(topN query on 1 week and day granularity -> 741ms),

***2: ***(groupBy on 3 dimensions, 1 week and day granularity -> 5s),

***3: ***(same groupBy on 1 month -> 26s)

I tried with 5 historical nodes:

***4: ***(same groupBy on 1 month -> 21s).

2- I created a second database “VisitesHour” and a third “VisitesDay” where the input is the same and the queryGranularity is changed for “hour” and “day” respectively. Segments of 300~500 MB.

1: hour: 584ms, day: 564ms

2: hour: 4,5s, day: 3,8s

3: hour: 22s, day: 22s

4: hour: 11~18s, day: 11~15s

So my first question is: Do you think the time for each query can match with druid performances or there is something suspicious about it ?

Next, I tried to create a database with only the 3 dimensions of my query. Respectively VisitesCobros, VisitesCobrosHour, VisitesCobrosDay for minute, hour and day query granularity (100~130MB, 8MB, 8MB).

I hoped for execution to take less than 1 sec.

2: minute: 4,5s , hour: 1,4s, day: 1,3s

3: minute: 15s, hour: 6,6s, day: 6,3s

4: minute: 9s, hour: 7,5s, day: 7,3s

Here is the point i totally lost my mind, because the queries should have been a lot more faster and it’s not the case at all. So why create a database with the 3 dimensions i need in my query (which is not what i want in production) doesn’t make it faster?

I really think and hope i have bad configurations that make the time grow, because i don’t expect to have these performances with Druid which what i read.

And i wanted to know if someone could explain me the main difference between TopN and groupBy that make the topN query can’t accept multidimensions ?

Thanks,

Ben

Hey Ben,

How big is the result-set for those groupBy queries? Currently groupBy has performance issues that are especially noticeable with larger resultsets. It is definitely not as optimized as topN and timeseries.

We are working on a new groupBy engine, scheduled for 0.9.2, that we expect to be substantially faster in many cases (see benchmarks in https://github.com/druid-io/druid/pull/2998). If you’re feeling brave, you could build master and try that out (set “groupByStrategy” : “v2” in your query context).

Another thing you could try is “cascading topNs”. This is where you do something like a 3-dimensional groupBy by actually doing 1 topN query for the first dim, then another over the second dim for each value of the first dim, etc. This is how Pivot (http://imply.io/post/2015/10/26/hello-pivot.html) generates its Table view. With limits on the topNs this can actually be a better user experience overall. Applying the limiting at each level rather than on the overall table means that at each level, users see the highest ranked values, rather than just seeing the highest ranked overall tuples.

That was a long time question ! But many thanks for your reply !
I investigate a lot since that time and applys so many different strategy and benchs to have good performances.

If i could give any advice for the ones who want performances on groupBy:

  • TopN Work very well on a large datasource, so test your response time before creating new datasources

  • If you want good perfs on groupBy you will have to play on many parameters (dims of your datasource -> watch out for high cardinality dims, queryGranularity, segmentGranularity and Shards)

And never stop investigate on metrics ! It could be VERY VERY useful in order to chose which configuration and which hardware you need.

Ben