Deciding between TopN and GroupBy

Hello community:

I am building out a an object that will decide what query to use depending on the parameters passed into it.

My logic so far is:

  • If we want multiple dimensions or a having clause, use GroupBy

  • If we want a single dimension (no having clause), use TopN

However, after reading about TopN’s inaccuracies with data points that have a high cardinality, I am wondering whether or not I should use it.

Hence, I have some questions:

  • Assuming I want 100% accuracy for a single dimension, should I use GroupBy, or TopN with a very high threshold?

  • For multiple dimensions, would it be more efficient to use one groupBy or many TopNs (like it is done in Pivot)?

Thanks in advance!

  • Rafael

Depends on what you’re looking for.

The internal rule here is “if you can use topN you should”, so I’ll address topN directly.

Note that the following applies for when your sorting dimension is greater than the topN threshold (Max(1000, topNLimit) by default, but is configurable)

In general, topN is really good at finding things which rise above the noise. So if your sorting metric is all noise, then you’re not going to get accurate results.

It is worth noting that you can take a topN, and use the results in a second topN if you want to be incredibly paranoid about metric accuracy (the sorting will still be approximate)

If you’re wanting to just download a bunch of metrics per dimension value for a high cardinality dimension (some sort of ID) then paginating through lexicographic topNs actually works pretty well.

Since lexicographic sorting is, by its nature, NOT noisy, it returns accurate results.

Thanks Charles…

I will analyze our data and take your advice into consideration.

Hi,

How about the trusted of the result of current Pivot applying which used multiple topN queries?

Can we consider the result of multiple dimensions query on by topN on low cardinality (dimension <1000 values) with some metrics and limit 5-10 each which is a trusted result?

Regards,

Chanh

TopN for dimensions with <1000 cardinality with any limit will always be correct.