Query to get count of users that have two devices.

I have sample data as follows

user device some metric another metric

A ABC 5 8

A ABC 6 10

A DEF 5 7

A DEF 7 3

B GHI 5 8

B GHI 7 6

C JKL 5 8

C MNO 3 2

C PQR 5 8

D STU 2 7

D STU 2 0

D VWX 3 4

I am trying to create a query the total count of users that have only 2 devices and the device count.

I want to get the following result.

{

“version” : “v1”,

“timestamp” : “2016-02-16T00:00:00.000Z”,

“event” : {

“user_count” : 2,

“device_count” : 4

}

I tried using group by queries, but I get a JSON object for each user. Any pointers would be helpful. Thanks!

I have bad news and good news.

The bad news: the solution I can offer is pretty clunky and inelegant.

The good news: there is work being done on multiple fronts to make this a lot better soon.

The (current) solution:

Manually run two queries, one to get the list of users and another to filter on that list and get the totals.

Query 1 ideally should be this (in PlyQL):

select user, count(distinct device) as devices_count from data group by user having devices_count > 1.5 and devices_count < 2.5

notice the ±0.5 error bound on the device_count this is to account for having approximate counts there that may not be integers.

BUT THAT WONT WORK! there is a bug in Druid where you can not have a having filter on the result of a HyperLogLog

To get around that you could do this (in PlyQL or equivalently in Druid):

select user, count(distinct device)-2 as devices_count_above_2 from data group by user having devices_count_above_2 > -0.5 and devices_count_above_2 < 0.5

No that you have a list of users the second query is straight forward.

This approach is still not optimal. You have to run two queries and you have to materialize what might be a huge list of users.

There are 3 areas which are getting fixed right now:

  1. The having on HLL bug is being worked on.

  2. There is talk of adding support for Bloom Filter based filters which would prevent you from having to materialize a huge list of users

  3. I am working on a planner feature in Plywood/PlyQL to allow you to use a query as a filter input to another query.

Best regards,

Vadim

Vadim,

Thanks for the response! Unfortunately it looks like we will have to issue two queries like you stated. Bloom filter is interesting and hopefully will make queries like this much easier.

Will