GroupBy query scalability

Hi guys,

We are considering using Druid for analytics in our company and one of our use cases is groupping data by up to 3 dimentions for relatively big periods (several months).

I spent some time profiling and debuging and found that GroupBy query execution is paralleled on several Historical nodes and each node processes it by multiple threads, so looks like Historical nodes will scale well. But I see that Broker nodes always process query by single thread.

For now we have small cluster with 3 Historical nodes and one Broker, grouping data for 3 months by 3 dimensions takes ~3 sec on Historicals + ~3 sec on Broker. So if amount of data grows we can add Historical nodes to cluster to keep processing time on the same level, but looks like Broker will be a bottleneck in this case.

Another issue is that despite using limit in my query Historical nodes return full data set to Broker. In my case I use limit = 50, but Broker node receives ~500 000 records, builds index and then returns 50 records to client app.

Probably I should tune some Druid settings or use different query?

My query is:

{

“dimensions”: [

“dim1”,

“dim2”,

“dim3”

],

“aggregations”: [

{

“fieldName”: “quantity”,

“type”: “longSum”,

“name”: “quantity”

}

],

“intervals”: “2016-06-01T00:00:00+00:00/2016-08-31T00:00:00+00:00”,

“limitSpec”: {

“limit”: 50,

“type”: “default”,

“columns”: [

{

“direction”: “descending”,

“dimension”: “quantity”

}

]

},

“granularity”: “all”,

“postAggregations”: ,

“queryType”: “groupBy”,

“dataSource”: “events”

}

``

Broker runtime.properties:

druid.service=druid/broker

druid.port=8082

druid.broker.http.numConnections=5

druid.server.http.numThreads=25

druid.processing.buffer.sizeBytes=536870912

druid.processing.numThreads=6

druid.broker.cache.useCache=false

druid.broker.cache.populateCache=false

druid.query.groupBy.maxIntermediateRows=1000000

druid.query.groupBy.maxResults=10000000

``

Historical runtime.properties:

druid.service=druid/historical

druid.port=8083

druid.server.http.numThreads=25

druid.processing.buffer.sizeBytes=1610612736

druid.processing.numThreads=5

druid.segmentCache.locations=[{“path”:“var/druid/segment-cache”,“maxSize”:160000000000}]

druid.server.maxSize=160000000000

druid.query.groupBy.maxIntermediateRows=1000000

druid.query.groupBy.maxResults=10000000

``

Thanks!

Hi Sergii,
you can get some parallelism on broker by setting up chunkPeriod for your long queries.
See http://druid.io/docs/latest/querying/query-context.html for details on chunkPeriod

You should also look at Druid 0.9.2 as we have completely rewritten the groupBy engine to handle larger result sets:https://groups.google.com/forum/#!topic/druid-user/7LY8PUqGuAA