Druid hadoop reindex task

Hi,

I am trying to run reindex task with inputSpec type dataSource in order to merge segments created by kafka supervisor (as recommended also in druid docs)

http://druid.io/docs/latest/ingestion/update-existing-data.html

but from the example in the link above, its not much clear how exactly the specs file should look like and which components it should include

for example. should I include parserSpec such as in the hadoop batch ingestion spec ?

those are my attempts:


{

"type" : "index_hadoop",

"spec" : {

"dataSchema" : {

"dataSource" : "mdp_parking_events",

"parser": {

"type": "hadoopyString",

"parseSpec": {

"format": "json",

"timestampSpec": {

"format": "auto",

"column": "timestamp"

}

}

},

"granularitySpec" : {

"type" : "uniform",

"segmentGranularity" : "DAY",

"queryGranularity" : "DAY",

"intervals" : [ "2017-08-23/2017-08-24" ]

}

},

"ioConfig" : {

"type" : "hadoop",

"inputSpec" : {

"type" : "dataSource",

"ingestionSpec": {

"dataSource": "mdp_parking_events",

"intervals": ["2017-08-23T00:00:00Z/P1D"]

}

}

},

"tuningConfig" : {

"type" : "hadoop",

"jobProperties": {

"mapreduce.job.classloader": "true",

"mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop."

}

}

},

"hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.6.0"]

}


{

"type" : "index_hadoop",

"spec" : {

"dataSchema" : {

"dataSource" : "mdp_parking_events"

},

"ioConfig" : {

"type" : "hadoop",

"inputSpec" : {

"type" : "dataSource",

"ingestionSpec": {

"dataSource": "mdp_parking_events",

"intervals": ["2017-08-23T00:00:00Z/P1D"]

}

}

},

"tuningConfig" : {

"type" : "hadoop",

"jobProperties": {

"mapreduce.job.classloader": "true",

"mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop."

}

}

},

"hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.6.0"]

}

*** I dont specify the dimensions and metrics as according to the druid docs, those would be automatically picked up from the datasource

dimensions
Array of String
Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have an explicit list of dimensions then all the dimension columns present in stored data will be read.
no

metrics
Array of String
Name of metric columns to load. By default, the list will be constructed from the “name” of all the configured aggregators.

and if parseSepc do need to present, what components should it have ? (a full example would be great ! )

but all of the attempts fail with the following error.

What do I miss ?

I am able to run simple hadoop batch index tasks without any issues


java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:211) ~[druid-indexing-service-0.10.0.jar:0.10.0]
at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:176) ~[druid-indexing-service-0.10.0.jar:0.10.0]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.10.0.jar:0.10.0]
at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.10.0.jar:0.10.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:208) ~[druid-indexing-service-0.10.0.jar:0.10.0]
… 7 more
Caused by: java.lang.NullPointerException
at io.druid.indexer.path.DatasourcePathSpec.addInputPaths(DatasourcePathSpec.java:117) ~[druid-indexing-hadoop-0.10.0.jar:0.10.0]
at io.druid.indexer.HadoopDruidIndexerConfig.addInputPaths(HadoopDruidIndexerConfig.java:389) ~[druid-indexing-hadoop-0.10.0.jar:0.10.0]
at io.druid.indexer.JobHelper.ensurePaths(JobHelper.java:337) ~[druid-indexing-hadoop-0.10.0.jar:0.10.0]
at io.druid.indexer.HadoopDruidDetermineConfigurationJob.run(HadoopDruidDetermineConfigurationJob.java:55) ~[druid-indexing-hadoop-0.10.0.jar:0.10.0]
at io.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessing.runTask(HadoopIndexTask.java:306) ~[druid-indexing-service-0.10.0.jar:0.10.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_131]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_131]
at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:208) ~[druid-indexing-service-0.10.0.jar:0.10.0]
… 7 more
2017-08-24T17:16:08,766 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_mdp_parking_events_2017-08-24T17:15:52.259Z] status changed to [FAILED].
2017-08-24T17:16:08,769 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_hadoop_mdp_parking_events_2017-08-24T17:15:52.259Z”,
“status” : “FAILED”,
“duration” : 3437
}


Thanks,