HDFS output path with datasource name twice

Hello,

I am doing indexing with Druid (version 0.8.0) using the “index_hadoop” method and the resulting segments are not showing up anywhere that I can query, and what it looks like is happening is the output directory in HDFS that it is saving the segment to is duplicated.

Example:

Input index config:

{

“type” : “index_hadoop”,

“spec” : {

“dataSchema” : {

“dataSource” : “example1datasrc”,

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “json”,

“timestampSpec” : {

“column” : “timestamp”,

“format” : “auto”

},

“dimensionsSpec” : {

“dimensions”: [“dim1”,“dim2”,“dim3”],

“dimensionExclusions” : ,

“spatialDimensions” :

}

}

},

“metricsSpec” : [{“name”:“count”,“type”:“count”},{“fieldName”:“cid”,“name”:“cid”,“type”:“hyperUnique”}],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “DAY”,

“queryGranularity” : “HOUR”,

“intervals” : [ “2015-12-20/2015-12-21” ]

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “/tmp/example1datasrc-2015-12-20_473152148”

}

}

}

}

This gets submitted to an overlord node. However in the log when we get here:

2016-02-03T01:36:44,619 INFO [LocalJobRunner Map Task Executor #0] io.druid.indexer.HadoopDruidIndexerConfig - Running with config:

It has this field:

      "segmentOutputPath" : "hdfs://localhost:9000/user/druid/storage/example1datasrc"

And then the resulting place this ends up in HDFS is:

/user/druid/storage/example1datasrc/example1datasrc/20151220T000000.000Z_20151221T000000.000Z

And is then not queryable (and does not show up in the coordinator web console). Data directly in /user/druid/storage/example1datasrc/ is queryable, as you would expect.

Any ideas on what could be causing this?

fwiw - I still didn’t figure out what this duplication was all about - but it turns out my basic problem was that my historical node (have only one in dev setup) had a max of 10G and so the hand-off of segments after the indexing to the historical node was failing. Increasing druid.server.maxSize in the historical config and restarting everything solved the funky issues I was running into.

Best, Brad

Hmmm, I recall we fixed that issue awhile ago. Can you update to 0.8.3 and see if you have the same problem?

Hi,

dataSource appearing twice on hdfs should be fixed in druid-0.8.3 .

– Himanshu

Thanks for the replis, Himanshu and Fangjin -

I can confirm that - upgraded to 0.8.3 and the issue went away.

Best, Brad