Configure druid to parse json files with nested structures - failing

New to Druid.
I would like something to query lots of json files gz’ed in s3 but i’m testing out on a small local sample file.

The json has a few nested levels.

Looks like…

{

“arr”: [

{

“data”: [

{

“delta_t”: 1,

“f”: 60,

“i”: [

-1,

-1,

-1

],

“kw”: [

68.948,

79.242,

67.05

],

“orig_t”: “2015-07-28T15:19:18.769”,

“t”: “2015-07-28T15:19:18.769”,

“v”: [

-1,

-1,

-1

]

}

],

“id”: “this-that-the-pther”

}

],

“ver”: “1.0”

}

I configured a job schema like…

{

“type” : “index_hadoop”,

“spec” : {

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “/home/ubuntu/datawarehouse/data.json”

}

},

“dataSchema” : {

“dataSource” : “test-job”,

“granularitySpec” : {

“type” : “uniform”

},

“parser” : {

“type” : “string”,

“parseSpec”: {

“format”: “json”,

“flattenSpec”: {

“useFieldDiscovery”: true,

“fields”: [

{

“type”: “nested”,

“name”: “id”,

“expr”: “$.arr.id”

},

{

“type”: “nested”,

“name”: “t”,

“expr”: “$.arr.data.t”

},

{

“type”: “nested”,

“name”: “orig_t”,

“expr”: “$.arr.data.orig_t”

},

{

“type”: “nested”,

“name”: “f”,

“expr”: “$.arr.data.f”

},

{

“type”: “nested”,

“name”: “v_0”,

“expr”: “$.arr.data.v[0]”

},

{

“type”: “nested”,

“name”: “v_1”,

“expr”: “$.arr.data.v[1]”

},

{

“type”: “nested”,

“name”: “v_2”,

“expr”: “$.arr.data.v[2]”

},

{

“type”: “nested”,

“name”: “i_0”,

“expr”: “$.arr.data.i[0]”

},

{

“type”: “nested”,

“name”: “i_1”,

“expr”: “$.arr.data.i[1]”

},

{

“type”: “nested”,

“name”: “i_2”,

“expr”: “$.arr.data.i[2]”

},

{

“type”: “nested”,

“name”: “kw_0”,

“expr”: “$.arr.data.kw[0]”

},

{

“type”: “nested”,

“name”: “kw_1”,

“expr”: “$.arr.data.kw[1]”

},

{

“type”: “nested”,

“name”: “kw_2”,

“expr”: “$.arr.data.kw[2]”

},

{

“type”: “nested”,

“name”: “delta_t”,

“expr”: “$.arr.data.delta_t”

}

]

},

“dimensionsSpec” : {

“dimensions” : [“ver”, “id”]

},

“timestampSpec” : {

“format” : “auto”,

“column” : “t”

}

}

},

“metricsSpec” : [

{“name”: “views”, “type”: “count”}

]

},

“tuningConfig” : {

“type” : “hadoop”,

“partitionsSpec” : {

“type” : “hashed”,

“targetPartitionSize” : 5000000

}

}

}

}

It tires to run then fails.

I can’t make sense of the logs but one thing that stands out is…

2016-06-02T21:43:52,564 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Starting flush of map output

2016-06-02T21:43:52,573 INFO [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.

2016-06-02T21:43:52,574 WARN [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - job_local348698092_0001

java.lang.Exception: java.lang.IllegalArgumentException: Can not construct instance of io.druid.data.input.impl.JSONPathFieldType, problem: No enum constant io.druid.data.input.impl.JSONPathFieldType.NESTED

at [Source: N/A; line: -1, column: -1] (through reference chain: java.util.ArrayList[0])

at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]

Caused by: java.lang.IllegalArgumentException: Can not construct instance of io.druid.data.input.impl.JSONPathFieldType, problem: No enum constant io.druid.data.input.impl.JSONPathFieldType.NESTED

at [Source: N/A; line: -1, column: -1] (through reference chain: java.util.ArrayList[0])

Any Ideas where I’m screwing up?

Thanks yall!

-SK

Hi Scott,

Can you try changing the “type” property on the field definitions to use “path” instead of “nested”? The docs are out-of-date for that section

Thanks,

Jon

ah ha! Thank you!
Got passed that now its failing with…

java.lang.Exception: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]

This is probably suggesting my 'flattenSpec' is incorrect? It seems to make each json blob an array. do i need to update my paths to something like...
                            {
                                "type": "path",
                                "name": "delta_t",
                                "expr": "$[0].arr.data.delta_t"
                            }


The row looks like an array there, but I think the outer brackets in the exception are coming from the exception message itself:

throw new RE(e, "Failure on row[%s]", value);


Did you see any more detailed exceptions in the logs that might point to the field(s) that had errors?

yep, I didn’t look carefully enough

Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!

which is:

"t": "2015-07-28T15:19:18.769",

maybe druid doesn't like the .769 probably milliseconds.

See this link about timestamp formats druid supports:

That says druid supports iso and this is pretty clearly iso
“t”: “2015-07-28T15:19:18.769”

these stands out:

Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!

Caused by: java.lang.NullPointerException: Null timestamp in input: {ver=1.0}

a larger snip of the log:

2016-06-05T23:20:05,534 INFO [LocalJobRunner Map Task Executor #0] io.druid.indexer.HadoopDruidIndexerConfig - Running with config:
{
  "spec" : {
    "dataSchema" : {
      "dataSource" : "vmonitor.site.telemetry",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "t",
            "format" : "iso"
          },
          "flattenSpec" : {
            "useFieldDiscovery" : true,
            "fields" : [ {
              "type" : "path",
              "name" : "id",
              "expr" : "$.arr.id"
            }, {
              "type" : "path",
              "name" : "t",
              "expr" : "$.arr.data.t"
            }, {
              "type" : "path",
              "name" : "orig_t",
              "expr" : "$.arr.data.orig_t"
            }, {
              "type" : "path",
              "name" : "f",
              "expr" : "$.arr.data.f"
            }, {
              "type" : "path",
              "name" : "v_0",
              "expr" : "$.arr.data.v[0]"
            }, {
              "type" : "path",
              "name" : "v_1",
              "expr" : "$.arr.data.v[1]"
            }, {
              "type" : "path",
              "name" : "v_2",
              "expr" : "$.arr.data.v[2]"
            }, {
              "type" : "path",
              "name" : "i_0",
              "expr" : "$.arr.data.i[0]"
            }, {
              "type" : "path",
              "name" : "i_1",
              "expr" : "$.arr.data.i[1]"
            }, {
              "type" : "path",
              "name" : "i_2",
              "expr" : "$.arr.data.i[2]"
            }, {
              "type" : "path",
              "name" : "kw_0",
              "expr" : "$.arr.data.kw[0]"
            }, {
              "type" : "path",
              "name" : "kw_1",
              "expr" : "$.arr.data.kw[1]"
            }, {
              "type" : "path",
              "name" : "kw_2",
              "expr" : "$.arr.data.kw[2]"
            }, {
              "type" : "path",
              "name" : "delta_t",
              "expr" : "$.arr.data.delta_t"
            } ]
          },
          "dimensionsSpec" : {
            "dimensions" : [ "ver", "id" ]
          }
        }
      },
      "metricsSpec" : [ {
        "type" : "count",
        "name" : "views"
      }, {
        "type" : "count",
        "name" : "kw_0"
      }, {
        "type" : "count",
        "name" : "kw_1"
      }, {
        "type" : "count",
        "name" : "kw_2"
      } ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : {
          "type" : "none"
        },
        "intervals" : null
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "/home/ubuntu/datawarehouse/vmonitor.site.telemetry.json"
      },
      "metadataUpdateSpec" : null,
      "segmentOutputPath" : "file:/home/ubuntu/druid-0.9.0/var/druid/segments/vmonitor.site.telemetry"
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "workingPath" : "var/druid/hadoop-tmp",
      "version" : "2016-06-05T23:19:59.090Z",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000,
        "maxPartitionSize" : 7500000,
        "assumeGrouped" : false,
        "numShards" : -1
      },
      "shardSpecs" : { },
      "indexSpec" : {
        "bitmap" : {
          "type" : "concise"
        },
        "dimensionCompression" : null,
        "metricCompression" : null
      },
      "maxRowsInMemory" : 80000,
      "leaveIntermediate" : false,
      "cleanupOnFailure" : true,
      "overwriteFiles" : false,
      "ignoreInvalidRows" : false,
      "jobProperties" : { },
      "combineText" : false,
      "useCombiner" : false,
      "buildV9Directly" : false,
      "numBackgroundPersistThreads" : 0
    },
    "uniqueId" : "be11a9ca18e748b2b4f681ff9d42cdf7"
  }
}
2016-06-05T23:20:05,544 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2016-06-05T23:20:05,550 INFO [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2016-06-05T23:20:05,551 WARN [Thread-21] org.apache.hadoop.mapred.LocalJobRunner - job_local761517617_0001
java.lang.Exception: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) [hadoop-mapreduce-client-common-2.3.0.jar:?]
Caused by: com.metamx.common.RE: Failure on row[{"arr": [{"data": [{"f": 60, "i": [-1, -1, -1], "delta_t": 1, "kw": [68.948, 79.242, 67.05], "t": "2015-07-28T15:19:18.769", "v": [-1, -1, -1], "orig_t": "2015-07-28T15:19:18.769"}], "id": "pgx.hq.stem-8e-71-6b.vmonitor.site.telemetry"}], "ver": "1.0"}]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:88) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:282) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]
Caused by: com.metamx.common.parsers.ParseException: Unparseable timestamp found!
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:72) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:136) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:131) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.indexer.HadoopDruidIndexerMapper.parseInputRow(HadoopDruidIndexerMapper.java:98) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:69) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:282) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]
Caused by: java.lang.NullPointerException: Null timestamp in input: {ver=1.0}
	at io.druid.data.input.impl.MapInputRowParser.parse(MapInputRowParser.java:63) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:136) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:131) ~[druid-api-0.3.16.jar:0.3.16]
	at io.druid.indexer.HadoopDruidIndexerMapper.parseInputRow(HadoopDruidIndexerMapper.java:98) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:69) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:282) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) ~[hadoop-mapreduce-client-core-2.3.0.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.3.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_101]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_101]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_101]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_101]

changing to ‘path’ stopped the errors but the json object is not flattening but I cant see any dimensions that i’ve flattened. the only dimensions that are queryable are in the root of the json.

Hi Scott, are you sure all of your timestamps are valid?

It appears you have this row in your data; “{ver=1.0}” and that row definitely doesn’t have a timestamp

Hi Fangjin,
The timestamp is correct but my jayway was incorrect.

I was following the flattenSpec docs on druid.io but the jayway in that example is wrong.
This https://github.com/jayway/JsonPath was a big help.

Hello,

Will it be possible to create parse spec for array of no fix size.

My file contains multiple json with different array sizes.