Way to index and query nested dimensions

Hi

I am new to druid. Here is what i am trying to solve -

Input event:

event 1 ->

data: {

“key1” : “value1”,

“key2” : “value2”,

“pv”: 1,

“res” : {

“img” : {

“nm” : “img1”,

“p1” : 123,

“p2” : 456,

} ,

“xhrs”: [

{

“nm” : “xhrs1”,

“s1” : 345,

“s2” : 567,

},

{

“nm” : “xhrs2”,

“s1” : 234,

“s2” : 673,

}

]

}

}

event 2 ->

data: {

“key1” : “value1”,

“key2” : “value3”,

“pv” : 1,

“res” : {

“src” : {

“nm” : “src1”,

“p1” : 222,

“p2” : 333,

} ,

“xhrs”: [

{

“nm” : “xhrs11”,

“s1” : 444,

“s2” : 555,

},

{

“nm” : “xhrs2”,

“s1” : 666,

“s2” : 777,

}

{

“nm” : “xhrs3”,

“s1” : 888,

“s2” : 999,

}

]

}

}

Note: xhrs,nm, src, img, css, key1, key2 are dimensions and

pv, s1, s2, p1, p2 are metrics.

What we want to query -

Get the nm’s and metrics corresponding to xhrs given a filter criteria (say key1 = value1)

The result should include -

{

“nm” : “name1”,

“s1” : 345,

“s2” : 567,

},

{

“nm” : “name2”,

“s1” : 234,

“s2” : 673,

}

{

“nm” : “name1”,

“s1” : 444,

“s2” : 555,

},

{

“nm” : “name3”,

“s1” : 666,

“s2” : 777,

}

How do I achieve this?

Below are the options that i could think of -

Option1 - flatten out everything and index as below

event1-> {“key1” : “value1”, “key2”:“value2”, “pv”: 1, “res_img_nm”: “img1”, “res_img_p1”: 123, “res_img_p2”: 456, “res_xhrs_1_nm”: “xhrs1”, “res_xhrs_1_s1” : 345, “res_xhrs_1_s2”:567, “res_xhrs_2_nm”: “xhrs2”, “res_xhrs_2_s1” : 234, “res_xhrs_2_s2”:673 }

event2-> {“key1” : “value1”, “key2”:“value3”, “pv”: 1, “res_src_nm”: “src1”, “res_src_p1”: 222, “res_src_p2”: 333, “res_xhrs_1_nm”: “xhrs11”, “res_xhrs_1_s1” : 444, “res_xhrs_1_s2”:555, “res_xhrs_2_nm”: “xhrs2”, “res_xhrs_2_s1” : 666, “res_xhrs_2_s2”:777, “res_xhrs_3_nm”: “xhrs3”, “res_xhrs_3_s1” : 888, “res_xhrs_3_s2”:999 }

Pros:

No duplicates in data being indexed.

Cons:

parsing the result to check if dimension name has xhrs and to get the corresponding names and metric values from the query result.

Option2 - index multiple entries for every event as below -

event 1 ->

{“key1” : “value1”, “key2”:“value2”, “pv”: 1, “res_img_nm”: “img1”, “res_img_p1”: 123, “res_img_p2”: 456 }

{“key1” : “value1”, “key2”:“value2”, “res_xhrs_nm”: “xhrs1”, “res_xhrs_s1” : 345, “res_xhrs_s2”:567 }

{“key1” : “value1”, “key2”:“value2”, “res_xhrs_nm”: “xhrs2”, “res_xhrs_s1” : 234, “res_xhrs_s2”:673 }

{“key1” : “value1”, “key2”:“value3”, “pv”: 1, “res_src_nm”: “src1”, “res_src_p1”: 222, “res_src_p2”: 333}

{“key1” : “value1”, “key2”:“value3”, “res_xhrs_nm”: “xhrs11”, “res_xhrs_s1” : 444, “res_xhrs_s2”:555}

{“key1” : “value1”, “key2”:“value3”, “res_xhrs_nm”: “xhrs2”, “res_xhrs_s1” : 666, “res_xhrs_s2”:777 }

{“key1” : “value1”, “key2”:“value3”, “res_xhrs_nm”: “xhrs3”, “res_xhrs_s1” : 888, “res_xhrs_s2”:999 }

Pros:

Duplicate Data being indexed multiple times.

Cons:

straight forward to get the list of xhrs names and metrics given any filter criteria.

Can you tell me if there is a better way of achieving what i want?

If not, what would you suggest of the two options above and why?

Thanks

Shobana

Hey Shobana,

Do you need to do anything with this xhrs data other than retrieve it for rows matching specific filters (i.e. will you be using a Select query for this and nothing else)? Or do you also need to group on them, or filter on them, or aggregate them?

Option 1 sounds fine to me. Option 2 doesn’t sound as good, since breaking up the events into multiple rows will make it difficult to do event oriented aggregations. Another couple of options to consider:

  1. If all you need is to be able to retrieve the values, not group/filter/aggregate, the easiest thing is to store it encoded in a single field called xhrs. If you use JSON encoding this would look like “xhrs” : “[{“nm”:“xhrs11”,“s1”:444,“s2”:555},{“nm”:“xhrs2”,“s1”:666,“s2”:777}]”. Then when you select it out, decode it on the client side.

  2. Or you can use a multi-value dimension called xhrs, where each value is encoded as CSV like “nm:xhrs11,s1:444,s2:555”. So like “xhrs” : [“nm:xhrs11:s1:444,s2:555”, “nm:xhrs2,s1:666,s2:777”]. This can make it easier to filter on the values, if you need to.

Hi Gian

Thanks for your response.

We would need aggregation and group by supported on the “nm” of xhrs.

in sql world, it would translate to -

“select xhrs.nm, sum(xhrs.s1) from table where key1=value1 group by xhrs.nm;”

or “select distinct(xhrs.nm) from table where key1=value1;”

Is there a way to achieve this directly instead of decoding it on the client side and parsing to get the result?

Thanks

Shobana

I see. For anything you want to group on or aggregate, like xhrs.nm or xhrs.s1, you generally want “real” columns. Your data model doesn’t totally map cleanly onto Druid’s column types (you have a nested data model and Druid’s is flat) so this may require a bit of backing up and looking at the big picture of what you’re trying to achieve. One other possibility to consider is ingesting the data into two tables, one table with a row per event and one table with a row per xhrs entry. Then you can use the former table for event oriented queries and the latter table for xhrs oriented queries.

Hi Gian

2 tables is a good option.

But here are the downsides on going with 2 tables.

The example i shared has only one sample dimension (key1). But real time scenario has many dimensions. And most of the dimensions also apply for xhrs.

wouldn’t this mean duplicating those dimension data in the second table as well ?

Also we have about 5+ dimensions like xhrs which are nested dimensions.

With 2 tables - the second table could get 10 rows for every entry in the first table. About 15 dimensions would be duplicated in the second table for every entry (from the first table).

If we go with option #3 suggested below (storing the encoded value of xhrs and similar nested dimensions), the downside is that the cardinality of these dimensions would be huge.

wouldn’t that hit the performance of regular queries as well?

Also decoding millions of records in the client side and then computing the metrics from them for each such query request could impact the performance.

Considering the downside on both - i think 2 tables would be better bet. The storage space would be more but that would be the case with option #3 as well (with high cardinality and every metric plus dimension detail stored as single string). 2 tables atleast brings down the complexity and time for querying.

Any thoughts?