Dynamic indexing for flattening JSON spec

Hello,
I was trying to flatten the nested json objects coming in the druid cluster. I have a nested json object of the type.

{

“abc” : “def” ,

“xyz”: [

{

“id” : 123,

“pqr”: “ghi”
},

{

“id” : 456,

“pqr”: “jkl”
}

]

}

Now the issue here is i have a json array “xyz” which many have any number of objects which i don’t know initially. Is there any possible method to create automatic indexing while flattening the nested json such that if xyz[0].id is mapped to xyz_0_id then if in future a third object comes then that xyz[2].id is automatically mapped to xyz_2_id. Is there any alternate method to achieve the intended result somehow as the number of objects present in the object array is not known beforehand and may vary from time to time.

Thanks for your help,

Shubham

here can i somehow use javascript parse spec such that i run a loop over the xyz array till it’s length and then flatten all the json which come dynamically and then ingest it.

Hi Shubham,

Yeah, this is definitely not supported with flattening via the flattenSpec as described here http://druid.io/docs/latest/ingestion/flatten-json. If you know the upper limit of occurrences, a poor workaround would be to add a json path expression for each ‘$.xyz[n].id’ mapping to ‘xyz_n_id’ up to that limit and have null values for rows which do not contain these values.

I’ve been investigating enhancing the flattening abilities of Druid ingestion when dealing with nested lists to be able to ‘explode’ out one row with nested list data into multiple rows, distributing list elements one per row to flatten, but this type of flattening lists into a single row also seems potentially useful so I’ll take it into consideration. I unfortunately don’t have a timeline on doing this work, it’s just something I’d like to take a look at in the near future, so I can’t make any promises as to when such functionality will exist.

As for the javascript parse spec (which I forgot even exists!) I looked through the code and it would appear that if you can write a function that will take a json string, parse the json string into an object inside of this function, and then transform that object to your desired flattened output map, it should work. Disclaimer: I haven’t used this functionality personally so I’m unsure if this will actually work, and even if it does I would expect some amount of performance hit for the extra overhead of running a javascript function on every row. For further information, the details of using javascript parse spec is described here http://druid.io/docs/latest/ingestion/data-formats.html#javascript and general usage of javascript with Druid here http://druid.io/docs/latest/development/javascript.html.

Good luck!

Clint

Hi Clint,

I was considering your workaround, “a poor workaround would be to add a json path expression for each ‘$.xyz[n].id’ mapping to ‘xyz_n_id’ up to that limit and have null values for rows which do not contain these values.” ;

Is this doable with druid at the ingestion step?

“path” or “jq” expressions can be given with json flatten spec in order to access a specific field in a json object. But is it possible to map these values and use dynamically constructed column names like xyz_0_id, xyz_1_pqr etc. with similar approach?

Hi Can,

I do not think there exists a mechanism today to dynamically create column names like that, I think it would have to be explicitly defined manually for each item of the array to extract, which is why I suggested it was a “poor” workaround. I’ll poke around and update if I find otherwise though.

Hi Shubham,

I am also facing the same issue with kafka json packets.Can you share me the javascript code ,whatever you are used for parsing the dynamic nested json sample like below and then ingest,

“xyz”: [

{

“id” : 123,

“pqr”: “ghi”
},

{

“id” : 456,

“pqr”: “jkl”
}

]

Prasanna.P

Similar issue, but the object to flatten is map<string, int> which size of paris varied and new pair will be added in future. Finally, I use the “poor” solution in ingest spec,

{

“type”: “path”,

“name”: “abtest_test1”,

“expr”: “$.abtest.test1”

},

{

“type”: “path”,

“name”: “abtest_test2”,

“expr”: “$.abtest.test2”

},

It’s somehow ugly even though it works.

在 2018年10月22日星期一 UTC+8下午1:49:11,Shubham Goel写道: