[druid-user] Ingesting Arrays

Good day Druid users! I am trying to figure out how to ingest data that has an array in it’s structure. My data looks like this:
{
“mother_first_name”: “Jane”,
“mother_last_name”: “Doe”,
“mother_birthdate”: “1-1-1975”,
“dependents”: [
{
“dependent_first_name”: “Child 1”,
“dependent_last_name”: “Doe”,
“dependent_birthdate”: “3-3-2015”
},

{
“dependent_first_name”: “Child 2”,
“dependent_last_name”: “Doe”,
“dependent_birthdate”: “4-4-2017”
}

]
}

My flatten spec looks like:

“parseSpec”: {
“format”: “avro”,
“flattenSpec”: {
“useFieldDiscovery”: true,
“fields”: [
{
“name”: “dependent_first_name”,
“type”: “path”,
“expr”: “.dependents[0].dependent_first_name" }, { "name": "dependent_last_name", "type": "path", "expr": ".dependents[0].dependent_last_name”
},
{
“name”: “dependent_birthdate”,
“type”: “path”,
“expr”: “$.dependents[0].dependant_birthdate”
}
]
}
}

With a [0] in the expression it will pull the first item in the array, with [1] in the expression it will pull the second item in the array (as expected). With the array left empty the ingestion errors. The array can have 0, or more elements in it.

Hoping someone can help me with setting up ingesting arrays?

Cheers,
Donovan

Hm I think you might have issues here — though I bow to better people than me (!!!) — as I believe Druid will want to treat one input row as one target output row — hmmmm… tricky one. I will take a note myself to see if we can get some docs around your kind of example — also have you already asked the question in ASF Slack or the Druid Forum? More eyeballs :slight_smile:

Hey Donovan,
Is your goal to have multiple output rows from each input row (e,g if the array has 10 elements, are you looking to produce 10 output rows)?
If that’s the case, there’s no current way to achieve that in Druid, unfortunately. You’ll need to pre-process the data and “explode” the array into multiple rows prior to ingesting it into Druid (see this thread and this git PR).

Thanks,
Itai

Thanks for the reply. At the time we didn’t know what to expect the outcome to be, one row with extra columns, or multiple rows as you suggested. After some further discussion we found out that the data within the array isn’t all that valuable to have in our datasource for analytics. Having a count of array items is more valuable than what is in the array. We have adjusted our ingestion to accomplish this. However, if in the future we need some data from within the array we now know to do some pre-processing.

Thanks for the insight

Good, glad I was able to help (or at least clarify things) :slightly_smiling_face: