Batch Data Ingestion for Nested JSON Structures

Hi,

Every row in my dump file looks like the following one, in other

words, it contains nested structures.

{“eType”:“tp1”,“device”:“0000-0000-0000-0000-0000”,“cAt”:1490054400111,“dev”:{“id”:“0000-0000-0000-0000-0000”,“base”:{“aKey”:“1111-1111-1111-1111-1111”,“debug”:false}}}

I have managed to perform hadoop batch ingestions from both
compressed and uncompressed JSON files for any top-level column.
For example, I used the the following setting in order to let
Druid accept (store) only columns “eType”, “device” and “cAt”.

{

  "parser": {

    "type": "hadoopyString",

    "parseSpec": {

      "format": "json",

      "dimensionsSpec": {

        "dimensions": [

          "eType"

        ]

      },

      "timestampSpec": {

        "format": "auto",

        "column": "cAt"

      }

    }

  },

  "metricsSpec": [

    {

      "type": "count",

      "name": "c"

    },

    {

      "type": "hyperUnique",

      "name": "device_id_count",

      "fieldName": "device"

    }

  ]

}

My question is, can Druid flatten nested structures (e.g. extract
column “aKey”) on its own? If yes, how? If no, does this mean that
I have to perform some sort of per-prepossessing in order to
flatten my original dump files (i.e. remove the nested structure)?
Does this hold for parquet input files too?

I other words, does Druid expect that JSON rows are flatten? In
no, how can I extract and store column “aKey” or column “debug”
from the above example?

Thank you!

Does anybody know?

Yes, it can. Please see

http://druid.io/docs/latest/ingestion/flatten-json.html

Something like below should work.

“flattenSpec”: {

“useFieldDiscovery”: true,

“fields”: [

{

“type”: “root”,

“name”: “eKey”,

“expr”: “eKey”

},

{

“type”: “path”,

“name”: “aKey”,

“expr”: “$.dev.base.aKey”

},

{

“type”: “path”,

“name”: “debug”,

“expr”: “$.dev.base.debug”

}

]

}

Awesome! It worked!

One more question although a bit off-topic:

I want tryout data ingestion from parquet files too. Likewise, each row is nested. How can I extract (ingest) the values of a nested column (which goes 2 or 3 levels down)?

Thank you very much and apologize for the off-topic question!

Hi Kenji,

would be nice if druid docs could be updated accordingly :

http://druid.io/docs/0.10.0/tutorials/tutorial-batch.html

here it stays:

“”"

Druid supports TSV, CSV, and JSON out of the box. Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file containing flattened objects.

“”"