2 dimemsion groupby peformance problem

Hi guys,

We are considering using Druid for analytics in our company and one of our use cases is groupping data by up to 2 dimensions for relatively big combination. It takes several minutes and we don’t know what is problem.

Can you advice something to help ?

  • we have simple configuration with 1 historical node and 1 broker node.

  • DIM_1 has (5,000 Cardinality), DIM_2 has (60,000 Cardinality), Measure_1 ( group sum of count )

Any idea ?

Have you tried the new groupBy engine in 0.9.2? We’ve been working a lot recently on improving groupBy performance and most of the work has gone into the v2 engine.

See the docs here: http://druid.io/docs/0.9.2/querying/groupbyquery.html

You can try it out by:

  • Set druid.processing.numMergeBuffers to some non-zero number

  • Set “groupByStrategy”: “v2” in your query context

Possible tunings once you do that are druid.processing.buffer.sizeBytes, druid.query.groupBy.maxMergingDictionarySize, and druid.query.groupBy.maxOnDiskStorage.

Hi Gian.

Which server configuration should be changed? (Historical, Broker?, Router?, MiddleManager?)

Thanks.

Jay

Those settings all apply to the broker, historicals, and middleManagers if you use realtime tasks (on middleManagers, they’ll be inherited by realtime tasks). On the broker they’re only used for the outer query of a nested query (when its dataSource is of type “query”). On historical and middleManagers they’re used for all groupBy queries.