How to improve the performance of "groupBy"

Hi,all:

I want to get the numbers of all combinations of several dimensions in “hour” granularity in recent a week, so I used “groupBy” to get it.

But the latency is about 15s~20s, I want to dramaticlly decrease the latency. How to improve it?

The machine type and druid setting were followed as this doc. http://druid.io/docs/latest/Production-Cluster-Configuration.html

There are 3 historical nodes serving for this cluster, and the number of data is about 70 million in a week.

Does increase the value of “druid.processing.buffer.sizeBytes” can improve it? It seems the max value of it is about 2147483647, because its type is integer.

By the way, if the max value(2147483647) of “druid.processing.buffer.sizeBytes” is not enough in cluster, how to process this situation?

And another question is about “druid.query.groupBy.maxResults”, if the result of “groupBy” exceed it, it will return “Maximum number of rows reached”.

Dose it can be dynamically set in json body in “groupBy” query or other method to avoid this error result?

Thanks,

Tao

Hi Tao, I am curious about your use case with groupBy. Is it possible to accomplish the query you are asking for using iterated topNs or another combination of queries? There are things that can be done to improve the performance of groupBys, but I suspect using Druid’s other queries will solve your use case putting much less strain in the cluster.

Hi, Fangjin:

I see the doc description, the topN only support only one dimension, but I want to known several dimension combinations metrics. For example, in your doc, I want to get the added number of all the combination(“page”, “user”, “country”) in recently a week and granularity is “day”, how can I use the topN to get it?

Thanks,

Tao

在 2015年6月2日星期二 UTC+8上午11:12:56,Fangjin Yang写道:

You can try and increase the value of “druid.processing.numThreads” at historicals. See if that improves things for you.

“druid.query.groupBy.maxResults” can not be changed dynamically.

– Himanshu

Hi Tao, do you care about every individual value or you are just trying to get a count?

If you are only after the count metric, you can take a look at the cardinality aggregator: http://druid.io/docs/0.7.3/Aggregations.html

Although I believe the cardinality aggregator cannot do combinations. For that use case, it is probably best to try some of himanshu’s suggestions on performance. It would also be nice to know where the bottlenecks in your groupBys are (broker vs historical).

Thanks,Fangjin and Himanshu very much, the capability of machines and the setting were follow the doc suggestion, I chose “r3.8xlarge” and set the “druid.processing.numThreads” to 31 for historical and broker node. And also and “context” : {“chunkPeriod”: “P1D”} In json query body, but the latency is also unacceptable. Should I decrease the value of “targetPartitionSize”(It was set to “500000”) when I ingest the data, so that there will be more partitions of segments and will improve the “groupby” performance??

I also tried the cardinality aggregator, but it can’t return the every combination values of several dimensions.

And do you have other suggestions to realize my requirement? Thanks, very much

在 2015年6月3日星期三 UTC+8上午5:41:38,Fangjin Yang写道:

Hi Tao, the cardinality aggregator does support counting combinations of dimensions.
If you give it the list of dimensions and set the “byRow” flag to true it will count the number of combinations, similar to what groupBy would do.

Hi Tao, can you try a timeseries query with a cardinality aggregator?

Thanks,Xaiver and Fangjin, I’m not sure the query I used is right, it described as below.
{

"queryType": "timeseries", 

"dataSource": "rcv_data", 

"intervals": [ "2015-05-24T00:00/2015-06-01T00:00"], 

"granularity": "hour", 

"aggregations": [

    {"type": "longSum", "fieldName": "count", "name": "count"}, 

  	{"type": "cardinality", "fieldNames": ["country", "pkgname"], "name": "result", "byRow": true}

]

}

Then, it will return the result like this:

[

{

    "timestamp": "2015-05-24T00:00:00.000Z",

    "result": {

        "result": 8843.6648325824,

        "count": 3151028

    }

},

{

    "timestamp": "2015-05-24T01:00:00.000Z",

    "result": {

        "result": 8881.780564855593,

        "count": 3071602

    }

},

{

    "timestamp": "2015-05-24T02:00:00.000Z",

    "result": {

        "result": 8855.22371424502,

        "count": 3112279

    }

}

]

But we don’t the know the value of “result” is from which combination.

For example, the first return json

    "result": {

        "result": 8843.6648325824,

        "count": 3151028

    }

Maybe it is from the combination of country is “jp” and pkgname is “youtube”.

My requirement is to generate an report to get the metrics of many combination from several dimensions in my data.

Thanks,

Tao

在 2015年6月3日星期三 UTC+8上午8:22:45,Fangjin Yang写道:

Hi, Fangjin:

Sorry for wrong description.

In my view point, the cardinality aggregator is only for getting the number of combinations of several dimensions.

In my requirement, I want to generate a report, just like this:

In every hour, I want to know the number impression from pkg_name and country:

Date pkg_name country impression

2015-06-02T01 youtube jp 20000

2015-06-02T01 youtube hk 10000

2015-06-02T01 tudou us 30000

2015-06-02T23 cs uk 30000

The “pkg_name” and “country” are dimension, the “impression” is defined as metric.

We want to know the number impression of many combinations of “pkg_name” and “country”.

The value of “pkg_name” are so many, we can’t know all the values of it in advance.

Thanks,

Tao

在 2015年6月3日星期三 UTC+8上午8:22:45,Fangjin Yang写道:

Hi Luotao,

It doesn’t looks like you can use timeseries to get the report you are looking for.

In case your use case involves repeated queries, you can also try to enable caching for groupby queries.

caching of groupBy results might also help in improving performance.

Hi Tao, follow up question to understand your use case.

If you are generating a report that is several million rows of results, what do you plan to do with the results? If you need to do fast, interactive queries to power a data visualization, how do you plan to visualize so many results? If you are doing a report to be exported to another system, why are interactive results required?

Second question, do you need every possible combination for the dimensions, or do you actually want to know who the top pkg_names are, and for those top pkg_names, who the countries are, and for those countries, what the different impressions are?

Hi, Fangjin:

If you are generating a report that is several million rows of results, what do you plan to do with the results? If you need to do fast, interactive queries to power a data visualization, how do you plan to visualize so many results? If you are doing a report to be exported to another system, why are interactive results required?

This detail results is generated for marketer, to see the quality of impression, click, conversion or other metric in the some combination of dimension, for example, “pkg_name” and “country”. Your question make sense, The person will not spend so much time to check so many results.I think I should provide other interface(like hive) which is not realtime to them to get the report.

Second question, do you need every possible combination for the dimensions, or do you actually want to know who the top pkg_names are, and for those top pkg_names, who the countries are, and for those countries, what the different impressions are?

I don’t neet every possible combination for the dimensions, I can only need topN, for example, top 1000, the combination of “pkg_name” and “country”, whose impressions is in the top 1000. I also want to use “top N” to get it, but it seems that the “top N” only support only one dimension in its description

在 2015年6月4日星期四 UTC+8上午4:55:25,Fangjin Yang写道:

For interactive or visual workloads, you could consider using iterated topNs instead of a single big groupBy. The idea is that instead of grouping by (pkg_name, country) and taking the top 1000 tuples, you’d instead first find the top 20 (or whatever) pkg_names, and then for each pkg_name, find the top 100 countries using 20 filtered topN queries (each one filtered on a particular pkg_name in the top 20). This has some advantages:

  • You can draw a visualization progressively by showing results as individual queries return.

  • You can maybe even delay some of the later queries until a user clicks on something, like an “expand” or “show country breakdown” button for a pkg_name.

  • With a groupBy + limit, you never really know if you have the full picture for a given pkg_name, since some of the rows might have got limited out. With iterated topNs, you know that your per-pkg_name metrics show the whole story for those pkg_names.

Thanks Gian’s good suggestion.

在 2015年6月17日星期三 UTC+8上午2:08:35,Gian Merlino写道:

You can also look at the facet.js library for doing iterated topNs.