Refactor Group by query to TopN

Hi all,

We’re currently running a small Druid for evaluation as an analytics engine. All nodes run on a single server with 16gb ram and mechanical disks…

I have a query, which has to return the top 5 of an aggregated metric, along with a dimension_name and dimension_id.

Sadly the GroupBy seems to be a bit slow so i wondered what if i refactor this query to be 2 topN or 1 TopN and a scan.

Is this generally a practice? or it’s a pure performance/hardware issue and i should stick with the GroupBy and just add more ram

and maybe ssds to our machine?

Also, as a side question. If i TopN an id dimension, which always results to the same dimension_name, is there a way to get the name along with the

dimension_id in a topN query?

Thank you,

Michael

TopN query give you approximatation result. You should check you can accommodate errors.

2019년 1월 25일 금요일 오후 7시 59분 35초 UTC+9, Michael P 님의 말:

So if you want exact results you should stick with GroupBy?

Yes

2019년 1월 28일 월요일 오후 10시 1분 23초 UTC+9, Michael P 님의 말:

Hey Michael,

The docs here sum it up pretty well -> http://druid.io/docs/latest/querying/topnquery.html

A two-phase TopN as described in the example is exact in some cases and close enough in most.

Re: getting a dimension’s name/label from an ID. If the label’s in a lookup, you can use a Post Aggregation with an expression to retrieve the label.

Here’s an example postagg that could be included

{“type”: “expression”, “expression”: “lookup(dimension_id, ‘dimension_id_to_name’)”, “name”: “dimension_name”}

Cheers,

Dylan

Oh ok i see.

let’s say I’m ok with the approximate nature of the TopN query and i also do not have a lookup (which looks pretty cool actually!)

Does refactoring a 2 dimension GroupBy to 2 TopN queries stand as a valid technique when you want performance ?

Thanks,

Michael