Here’s a use case: in our UI, we’d like to populate a drop-down with the list of all the values contained in a particular dimension, so that the user can pick them in order to set filters. The values for these dimensions are not a fixed set, but rather something that can vary based on client and can grow over time. Is there a way to query Druid for this information? Basically like a SELECT DISTINCT query.
I can see that a TopN query is close to meeting this use case. But in a perfect world I’d want to know ALL the distinct dimensional values, not just N. (TopN may well be “good enough” if we just choose a large enough N… but in some cases we might have tens of thousands of possible distinct values in a filterable dimension.)
PS - We’re still on 0.8.2, though planning to upgrade soon.
The most efficient way to get all the distinct values is topNs with a lexicographic sort, and pull out all the values in pages (using previousStop).
Alternatively you could do something like what Pivot (https://github.com/implydata/pivot) does: it uses a limited topN to get a short list of values to show as a list, then offers search/auto-complete by doing filtered topNs on whatever a user has typed so far into a search box. This approach scales to dimensions with millions or more of distinct values, where pulling the entire list could take too long.
(resurrect old thread)
We have a different use case (not populating a select box in UI) that requires us to get all distinct values on a dimension. And in this case the number of distinct values can be around a million for our target date range.
So we can try the strategy of paged topNs using previousStop. I’m wondering, since we actually don’t care about the metric in this case - we’d just throw it away - is there an optimal aggregation to use? I’m using “count” but is that the best choice?
In 0.9.2 aggregations aren’t mandatory so just don’t provide them. Since your post, we also added groupBy v2 which can get all distinct values in a single query, without pagination. I’m not sure which one will be faster (that vs paginated topN) so give both a shot.
Thanks. Too bad we’re on 0.9.1.1! (Planning to roll forward in a couple of months but we’re heading towards product launch so we can’t do it sooner.)
FYI, using a nonpaginated TopN under 0.9.1.1 to retrieve ~1,100,000 distinct values is taking just over 2 minutes. We’ve found that we can get that in under 10 seconds using Amazon Athena. (not knocking Druid here - this kind of thing is obviously not in Druid’s sweet spot and our cluster is admittedly underpowered… but I’ll be curious to see how much it improves in later versions)
Try groupBy v2 once you upgrade, it should be better than topN for result sets that are that large.