Can druid aggregate on strings(dimension)?

Hi, I have a problem about the druid query. My datasource structure is as follows:

clientMac username osType traffic

00:19:F8:10:00:01 user0 Window 10

00:19:F8:10:00:01 user1 Linux 20

00:19:F8:10:00:02 user2 Android 30

00:19:F8:10:00:03 user3 ios 40

So when I create the indexes, the dimensions are [mac, username, osType] and the metric is traffic.

And what I want to do is to group the traffics by clientMac:

{

“queryType”: “groupBy”,

“dataSource”: “binnedSessions”,

“granularity”: “all”,

“dimensions”: [“clientMac”],

“aggregations”: [

{

“type”: “count”,

“name”: “totalCount”

}, {

“type”: “doubleSum”,

“name”: “traffic”,

“fieldName”: “traffic”

           }],

“intervals”: [

“2015-11-01T00:00:00.000/2015-12-20T23:00:00.000”

]

}

``

I can get

[

{

“version”: “v1”,

“timestamp”: “2015-11-01T00:00:00.000Z”,

“event”: {

“clientMac”: “00:19:F8:10:00:01”,

“totalCount”: 2,

  "traffic": 30

}

},

{

“version”: “v1”,

“timestamp”: “2015-11-01T00:00:00.000Z”,

“event”: {

“clientMac”: “00:19:F8:10:00:02”,

“totalCount”: 1,

  "traffic": 30

}

},
{

“version”: “v1”,

“timestamp”: “2015-11-01T00:00:00.000Z”,

“event”: {

“clientMac”: “00:19:F8:10:00:03”,

“totalCount”: 1,

  "traffic": 40

}

}
]

``

But, how can I also have the “username” and “osType” dimensions inside the query results:

[

{

“version”: “v1”,

“timestamp”: “2015-11-01T00:00:00.000Z”,

“event”: {

“clientMac”: “00:19:F8:10:00:01”,

“totalCount”: 2,
“username”: [“user0”, “user1”],
“osType”: [“Window”, “Linux”],

  "traffic": 30

}

},

{

“version”: “v1”,

“timestamp”: “2015-11-01T00:00:00.000Z”,

“event”: {

“clientMac”: “00:19:F8:10:00:02”,

“totalCount”: 1,
“username”: [“user2”],

“osType”: [“Android”],
“traffic”: 30

}

},
{

“version”: “v1”,

“timestamp”: “2015-11-01T00:00:00.000Z”,

“event”: {

“clientMac”: “00:19:F8:10:00:03”,

“totalCount”: 1,
“username”: [“user3”],

“osType”: [“ios”],
“traffic”: 40

}

}
]

``

Hi Tony,

What you are trying to do totally makes sense but unfortunately I doubt that you would be able to do it in a single Druid query any time soon. (Please, Druid peeps, prove me wrong!)

The query you want could be expressed as SQL as follows:

SELECT

clientMac,

MAKE_SET(username) AS ‘username’,

MAKE_SET(osType) AS ‘osType’,

COUNT(*) AS totalCount,

SUM(traffic) AS traffic

FROM binnedSessions

WHERE ‘2015-11-01T00:00:00’ <= __time AND __time < ‘2015-12-20T23:00:00.000’

This is not possible to do in Druid as a single query because Druid does not support a MAKE_SET aggregation. It probably wont in the near future since aggregations are heavily assumed to be numeric.

There is a library called Plywood ( https://github.com/implydata/plywood ) that acts as a query planner for Druid, amongst other things is allows you to issue SQL-like queries using a language called PlyQL.

Bad news is that Plywood/PlyQL do not support MAKE_SET ether (but should pretty soon).

Good news is that you can get the data out that you need and maybe even create queries that are more useful to you.

The following query will work and will give you all the information that you need:

SELECT

clientMac,

(SELECT username AS ‘val’ GROUP BY username) AS ‘username’,

(SELECT osType AS ‘val’ GROUP BY osType) AS ‘osType’,

COUNT(*) AS totalCount,

SUM(traffic) AS traffic

FROM binnedSessions

WHERE ‘2015-11-01T00:00:00’ <= __time AND __time < ‘2015-12-20T23:00:00.000’

The only difference is that instead of returning Sets like

“username”: [“user2”]

it will return Datasets like:

“username”: [{ “val”: “user2” }]

But you should be able to figure it out.

The added bonus is that you can add sorting and limiting in the nested queries like so:

SELECT

clientMac,

(SELECT username AS ‘val’, SUM(traffic) AS ‘t’ GROUP BY username ORDER BY t DESC LIMIT 100) AS ‘username’,

(SELECT osType AS ‘val’, SUM(traffic) AS ‘t’ GROUP BY osType ORDER BY t DESC LIMIT 100) AS ‘osType’,

COUNT(*) AS totalCount,

SUM(traffic) AS traffic

FROM binnedSessions

WHERE ‘2015-11-01T00:00:00’ <= __time AND __time < ‘2015-12-20T23:00:00.000’

And you can nest these queries as much as your heart desires.

If you want to try this, grab the PlyQL command line tool here: https://github.com/implydata/plyql and try pasting in the query above.

If you have any questions or suggestions for other queries you want to do, we would love to hear.

Please head over to:

https://groups.google.com/forum/#!forum/imply-user-group

Where we provide Plywood / PlyQL support.

Vadim is correct in that a “MAKE_SET from dimension” kind of functionality is not currently available in stock Druid. That would be a cool feature request though.

You CAN do multiple queries where the first query gets the list of mac ids and the second sequence of queries gets the sub-features with a filter of mac-id = whatever.

It is certainly clunky to do it in such a manner but can be accomplished with stock druid.

Would you mind referencing this topic in a feature request submission to https://github.com/druid-io/druid/issues ?

Hi Vadim,

Thank you very much, your reply is quite helpful. Now I already solved my problem by using two successive queries, just as Charles mentioned, first one to groupBy “clientMac”, and second one to query other according dimension values filtered by the clientMac list from the first query. I also tried PlyQL. It’s really cool. By the way, you missed the GroupBy clientMac in your sql select script, so it printed:

There was an error getting the data: could not resolve $clientMac

``

The correct one is:
plyql -h ip -q "

SELECT

clientMac,

(SELECT username AS ‘val’, SUM(traffic_r) AS ‘t’ GROUP BY username ORDER BY t DESC LIMIT 100) AS ‘username’,

(SELECT osType AS ‘val’, SUM(traffic_r) AS ‘t’ GROUP BY osType ORDER BY t DESC LIMIT 100) AS ‘osType’,

COUNT(*) AS totalCount,

SUM(traffic_r) AS traffic

FROM binnedSessions

WHERE ‘2015-11-01T00:00:00’ <= __time AND __time < ‘2015-12-20T23:00:00.000’

GROUP BY clientMac

LIMIT 5

"

``

Hi Charles,

Thanks a lot. I solved my problem based on your suggestion. Aggregation on dimension strings will be a cool feature. Hope druid will have it in the near future. Thank you for your efforts on powerful druid.

No problem, i will go to create a feature request issue on github.