Ability to do GroupBy aggregations without GroupBy field as dimension

Hi,

I have scoured the web for examples of Druid setups like this but have not had any luck. Wanted to make a post before passing off Druid as a possible solution.

I have a large amount of records in this format:

Timestamp, UserId, Field A

I want to be able to know different percentiles and averages of the number of records per user. Would also like to be able to know how many users have greater than X records in a day. I assume with 10-50 million users in a given day, there are too many to store UserId as a dimension. I’m also interested in knowing the median or average number of unique values of Field A per user. Again, there are too many unique values of Field A too be stored as a dimension. Does Druid support this group by before aggregation or would I have to provide Druid data that is already aggregated by user id? Looking for day granularity.

Thanks for the help.

John

Hey John,

Check out the approximate histogram extension, it’s intended for approximate histograms quantiles: http://druid.io/docs/latest/development/extensions-core/approximate-histograms.html

Gian,

I am aware of the approximate histogram extension and it was what I was hoping to leverage. But first I need a way for Druid to group by user ids at ingest without actually storing the user ids, so I can figure out the count of records per each user.

I am asking if Druid has a way to group by a field before creating the approx histogram.

Thanks for your help,

John

Ah I think I see what you’re asking. I think you’ll need to do a by-user aggregation job where you group by user and count the number of records and store the unique values of field A, then you can index that into Druid.