Hadoop druid indexer and "No buckets found?" error with Wikipedia example

Hi,

I am trying to use the standalone Hadoop indexer to run the Wikipedia example that comes with Druid and I am getting the following error after the map and reduce phase:

Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.

at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:160) ~[assembly_druid-assembly-0.1-SNAPSHOT.jar:0.1-SNAPSHOT]

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182) ~[assembly_druid-assembly-0.1-SNAPSHOT.jar:0.1-SNAPSHOT]

at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96) ~[assembly_druid-assembly-0.1-SNAPSHOT.jar:0.1-SNAPSHOT]

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182) ~[assembly_druid-assembly-0.1-SNAPSHOT.jar:0.1-SNAPSHOT]

at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:132) ~[assembly_druid-assembly-0.1-SNAPSHOT.jar:0.1-SNAPSHOT]

at io.druid.cli.Main.main(Main.java:91) ~[assembly_druid-assembly-0.1-SNAPSHOT.jar:0.1-SNAPSHOT]

I’ve looked at other topics in this group alluding to the same error referencing an incorrect timestamp spec but that doesn’t seem to apply here since I am using the sample Wikipedia data.

It must some obvious mistake but I am unable to figure it out :frowning:

I did build a custom fat jar to get around the jackson version conflict issue between Druid and Hadoop.

Here is the command:

java -Xmx256m -Dhdp.version=2.3.4.0-3485 -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath /root/lib/*:$(hadoop classpath) io.druid.cli.Main index hadoop examples/indexing/wikipedia_hadoop_config.json

The sample data is from the examples directory:

{“timestamp”: “2013-08-31T01:02:33Z”, “page”: “Gypsy Danger”, “language” : “en”, “user” : “nuclear”, “unpatrolled” : “true”, “newPage” : “true”, “robot”: “false”, “anonymous”: “false”, “namespace”:“article”, “continent”:“North America”, “country”:“United States”, “region”:“Bay Area”, “city”:“San Francisco”, “added”: 57, “deleted”: 200, “delta”: -143}

{“timestamp”: “2013-08-31T03:32:45Z”, “page”: “Striker Eureka”, “language” : “en”, “user” : “speed”, “unpatrolled” : “false”, “newPage” : “true”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“Australia”, “country”:“Australia”, “region”:“Cantebury”, “city”:“Syndey”, “added”: 459, “deleted”: 129, “delta”: 330}

{“timestamp”: “2013-08-31T07:11:21Z”, “page”: “Cherno Alpha”, “language” : “ru”, “user” : “masterYi”, “unpatrolled” : “false”, “newPage” : “true”, “robot”: “true”, “anonymous”: “false”, “namespace”:“article”, “continent”:“Asia”, “country”:“Russia”, “region”:“Oblast”, “city”:“Moscow”, “added”: 123, “deleted”: 12, “delta”: 111}

{“timestamp”: “2013-08-31T11:58:39Z”, “page”: “Crimson Typhoon”, “language” : “zh”, “user” : “triplets”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“Asia”, “country”:“China”, “region”:“Shanxi”, “city”:“Taiyuan”, “added”: 905, “deleted”: 5, “delta”: 900}

{“timestamp”: “2013-08-31T12:41:27Z”, “page”: “Coyote Tango”, “language” : “ja”, “user” : “cancer”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“Asia”, “country”:“Japan”, “region”:“Kanto”, “city”:“Tokyo”, “added”: 1, “deleted”: 10, “delta”: -9}

Here is the slightly modified spec file:

{

“dataSchema”: {

“dataSource”: “wikipedia”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “auto”

},

“dimensionsSpec”: {

“dimensions”: [

“page”,

“language”,

“user”,

“unpatrolled”,

“newPage”,

“robot”,

“anonymous”,

“namespace”,

“continent”,

“country”,

“region”,

“city”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “doubleSum”,

“name”: “added”,

“fieldName”: “added”

},

{

“type”: “doubleSum”,

“name”: “deleted”,

“fieldName”: “deleted”

},

{

“type”: “doubleSum”,

“name”: “delta”,

“fieldName”: “delta”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “NONE”,

“intervals”: [“2013-08-31/2013-09-01”]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “hdfs:///data/wikipedia_data.json”

},

“metadataUpdateSpec”: {

“type”: “mysql”,

“connectURI”: “jdbc:mysql://localhost:3306/druid”,

“user”: “druid”,

“password”: “diurd”,

“segmentTable”: “druid_segments”

},

“segmentOutputPath”: “/tmp/segments”

},

“tuningConfig”: {

“type”: “hadoop”,

“workingPath”: “/tmp/working_path”,

“partitionsSpec”: {

“type” : “dimension”,

“targetPartitionSize”: 5000000

}

}

}

Thanks!

Your data appears to be in HDFS. Do you have the correct configurations in the common configs?

Hi,

here is one detail not mentioned in Hortonworks documentation when Druid is installed. There are two parameters in MapReduce2 that have to be tweaked in order for Druid to successfully load data. Explanation is at the bottom.

The parameters are:

  • mapreduce.map.java.opts
  • mapreduce.reduce.java.opts
    The following should be added at the end of the existing values:

-Duser.timezone=UTC -Dfile.encoding=UTF-8

How it looks in Ambari:

The service MapReduce2 should now be restarted

I ran into this issue last week, not sure how it got resolved. I think it was an issue with some missing jars.

~Pratik

https://groups.google.com/forum/#!topic/druid-user/Zm-VWhl3X6Y should help you. I think its a timezone spec problem.
I remember facing the same, we have the following properties in our jobProperties

"mapreduce.map.java.opts": "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
"mapreduce.reduce.java.opts": "-Duser.timezone=UTC -Dfile.encoding=UTF-8"