Batch ingest avro file failed with "No buckets? seems there is no data to index" msg

I am trying to batch ingest avro file using hadoop parser, but it failed with the following message. Could someone help? Thanks,

2019-01-11T14:48:10,977 INFO [task-runner-0-priority-0] org.apache.druid.indexer.path.StaticPathSpec - Adding paths[/tmp/druid/raven/small.avro]

2019-01-11T14:48:10,979 INFO [task-runner-0-priority-0] org.apache.druid.indexer.HadoopDruidIndexerJob - No metadataStorageUpdaterJob set in the config. This is cool if you are running a hadoop index task, otherwise nothing will be uploaded to database.

2019-01-11T14:48:10,992 INFO [task-runner-0-priority-0] org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2019-01-11T14:48:10,997 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.HadoopIndexTask - Encountered exception in HadoopIndexGeneratorInnerProcessing.

java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.

at org.apache.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:226) ~[druid-indexing-hadoop-0.13.0-incubating.jar:0.13.0-incubating]

at org.apache.druid.indexer.JobHelper.runJobs(JobHelper.java:376) ~[druid-indexing-hadoop-0.13.0-incubating.jar:0.13.0-incubating]

at org.apache.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96) ~[druid-indexing-hadoop-0.13.0-incubating.jar:0.13.0-incubating]

at org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessingRunner.runTask(HadoopIndexTask.java:612) [druid-indexing-service-0.13.0-incubating.jar:0.13.0-incubating]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_202-ea]

the timestamp column in avro file has

{

“name” : “ts_transaction_timestamp”,

“type” : [ {

“type” : “long”,

“logicalType” : “timestamp-micros”

}, “null” ]

}

And my index.json file looks like:

{

“type” : “index_hadoop”,

“spec” : {

“dataSchema” : {

“dataSource” : “small”,

“parser” : {

“type” : “avro_hadoop”,

“parseSpec” : {

“format” : “avro”,

“dimensionsSpec” : {

“dimensions” : [

“ts_card_hashed_client_id”,

]

},

“timestampSpec”: {

“column”: “ts_transaction_timestamp”,

“format”: “iso”

}

}

},

“metricsSpec” : [

{

“name”: “count”,

“type”: “count”

},

{

“type” : “doubleSum”,

“name” : “Total_Sales”,

“fieldName” : “ts_amount”

}

],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “day”,

“queryGranularity” : “none”,

“intervals” : [“2017-03-01/2018-12-15”],

“rollup” : false

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“inputFormat”: “org.apache.druid.data.input.avro.AvroValueInputFormat”,

“paths” : “/tmp/druid/raven/small.avro”

},

“appendToExisting” : true

},

“tuningConfig” : {

“type” : “hadoop”,

“partitionsSpec” : {

“type” : “hashed”,

“targetPartitionSize” : 5000000

},

“forceExtendableShardSpecs” : true,

“reportParseExceptions” : false,

“jobProperties” : {

“avro.schema.input.value.path”: “/tmp/druid/raven/small.avsc”,

“fs.default.name” : “hdfs://***”,

“fs.defaultFS” : “hdfs://***”,

“dfs.datanode.address” : “0.0.0.0:####”,

“dfs.client.use.datanode.hostname” : “true”,

“dfs.datanode.use.datanode.hostname” : “true”,

“yarn.resourcemanager.hostname” : “****”,

“yarn.nodemanager.vmem-check-enabled” : “false”,

“mapreduce.job.queuename” : “QU”,

“mapreduce.map.java.opts” : “-Duser.timezone=UTC -Dfile.encoding=UTF-8”,

“mapreduce.job.user.classpath.first” : “true”,

“mapreduce.job.classloader”: “true”,

“mapreduce.reduce.java.opts” : “-Duser.timezone=UTC -Dfile.encoding=UTF-8”,

“mapreduce.map.memory.mb” : 1024,

“mapreduce.reduce.memory.mb” : 1024

}

}

},

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.6.3”]

}

Christine,

is this path an HDFS path “paths” : "/tmp/druid/raven/small.avro” ? Since you are using hdfs as your input source. Also, is this in hdfs as well “avro.schema.input.value.path”: "/tmp/druid/raven/small.avsc” ?

If the small.avro file is in local file directory, that will not be visible by Druid.

Rommel Garcia

Hi, Rommel

Yes, both avro and avsc files are on hdfs at the correct location. Batch ingest for json file works fine. So I won’t think it is hdfs access issue.

Thanks for trying to help.

Christine

Answer my own question. I suspected that druid can’t use a column with

{

“name” : “ts_transaction_timestamp”,

“type” : [ {

“type” : “long”,

“logicalType” : “timestamp-micros”

}, “null” ]

}

to be the timestamp column.

So I did an experiment to specify my transaction timestamp file as a string type, instead of timestamptype when created the avro.

Now ingest works. Not sure whether it is a bug though.

The “null” might be the issue.

Rommel Garcia
Director, Field Engineering
rommel.garcia@imply.io
404.502.9672