Tuning a Prod Cluster

My company is starting to try out Druid to see if its performance is great, but we don’t want to start with r3.xlarge+ instances. Instead we have mostly m3.large, and 1 r3.large instances:

coordinator:

m3.large 2CPU, 7.5G

historical:

r3.large 2CPU, 15.25G

broker:

m3.large 2CPU, 7.5G

overlord:

m3.large 2CPU, 7.5G

The following are the params to my property files + JVM options:

“druid”: {

"version": "0.8.1",

"zookeepers": "main-zk",

"coordinator": "coordinator",

"coordinator_max_mem": "2g",

"coordinator_new_size": "512m",

"historical": "historical",

"historical_threads": "1",

"historical_buffer_size": "512000000",

"historical_seg_size": "7000000000",

"historical_max_mem": "6g",

"historical_new_size": "2g",

"historical_direct_mem": "2g",

"overlord": "overlord",

"overlord_threads": "1",

"overlord_buffer_size": "512000000",

"overlord_runner_mem": "2g",

"overlord_max_mem": "5g",

"overlord_new_size": "2g",

"overlord_direct_mem": "2g",

"broker": "broker",

"broker_threads": "1",

"broker_buffer_size": "512000000",

"broker_max_mem": "5g",

"broker_new_size": "2g",

"broker_direct_mem": "2g"

}

Can someone please help identify any wrongly configured bits of this cluster? I tried queries over only 1 week of data injected with queryGranularity & segmentGranularity of DAY, with each day of about 180MB of data (all in 1 segment), and it takes a few seconds to return for just 3 subsplits (e.g. time, country, region, domain) using PIVOT.

Am I setting these parameters approximately correctly?

Is the # of threads being 1 in broker and historical nodes the main bottleneck here? If so, what would you suggest me to do since most AWS instances with 4+ CPUs are quite expensive (r3 or otherwise)?

Geoff

Having 4 splits in Pivot ‘time > country > region > domain’ is going to generate a hella expensive query as the number of Druid queries issued under the hood is exponential in the number of splits. You should run Pivot with the --verbose flag and see what the times are on the underlying Druid requests.

Thanks Vadim - I will try the --verbose flag.

Note that even when I only have time/country (2 splits) the queries are already slow (3-5 seconds). And if pivot can’t do 4 splits, what’s the point of interactive/dynamic query for druid or pivot? The usual use case is usually multiple splits/groupBy, and from what I understand internally Pivot would translate the groupBy into iterated topNs, which are supposed to perform pretty good.

If not, what kind of dynamic queries would perform well in druid/pivot?

Hi Geoffrey,

Usually the idea behind interactive OLAP with sub-second-ish queries, is that you can view certain filters and group-by’s, change the filters or group by’s according to the data you see and what you are looking for, and quickly see the new results and iterate further. This makes sense as long as the visualization/s on the screen can be humanly scanned through relatively quickly.

If your table consists of 4 nested splits, and if each of those dimensions have a medium or large number of values, then the result would be a very long table which can barely be scrolled through. If a visualization takes hours to scan through, sub-second queries probably won’t make a big difference.

It is true that given the right dimensions and smart visualization selection, 4 splits can sometimes be scanned through quickly. (as for Pivot, it’s currently still quite limited about the visualizations it offers). But in general, I believe that common use cases for interactive analytics tend to involve a between one and a handful of separate topN’s and/or 2-3 nested splits.

Another thing to note, is that with some table products, you can select a large number of splits and have them collapsed by default, and then expand the ones you are interested in. In that case, you only generate topN’s when you expand an item. Take a look at this amusing example: http://orteil.dashnet.org/nested . In this case you get exactly the kind of sub-second interactive experience Druid excels at.

So if you are talking about a use case of 4 large-ish splits, perhaps this is what you are referring to. Pivot does not do this currently but might in the future. Off the top of my had I can’t think of an open source UI for Druid which does this.

I hope I am not out of line discussing UIs so much in this forum. Hopefully it is somewhat relevant because it has to do with Druid use cases.

Thanks for your response, Ofir.

As I mentioned, even when I did only 2 splits it was already slow. So maybe the problem isn’t the # of splits, but how I configured the nodes. Or simply, the limitation on the node hardware. I’m not sure which - that’s why I posted my config.

From what I gathered of the #druid-dev channel from fj, I most likely need to go with higher # of CPU instances. Please let me know if you think other things may help before I increased our monthly spend on the instances by $1000.

Geoff: How many data do you have right now segment wise?

queries over 1 week of data injected with queryGranularity & segmentGranularity of DAY, with each day of about 180MB of data (all in 1 segment), and it takes a few seconds to return for just 1 subsplits (e.g. time, country) using PIVOT.

For that volume of data and performance, it seems like things are misconfigured for your hardware.

do you see anything in the config I attached at the first message of this thread?

Hi Geoff, yes, there are major problems with your configs. You appear to have taken configs for a completely different hardware setup and applied them to your own hardware.