Trouble indexing

Hey Druids,

I’m trying to get data into Druid. I’m running on AWS, so using the hadoop indexer by launching the task on EMR. It completes the MR over the input, but fails with the error:

2015-12-17T03:15:07,436 ERROR [main] io.druid.cli.CliHadoopIndexer - failure!!!
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_91]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_91]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_91]
at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_91]
at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:120) [druid-services-0.8.2.jar:0.8.2]
at io.druid.cli.Main.main(Main.java:91) [druid-services-0.8.2.jar:0.8.2]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.
at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:207) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:132) ~[druid-services-0.8.2.jar:0.8.2]
at io.druid.cli.Main.main(Main.java:91) ~[druid-services-0.8.2.jar:0.8.2]
… 6 more
Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.
at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:160) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182) ~[druid-indexing-hadoop-0.8.2.jar:0.8.2]
at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:132) ~[druid-services-0.8.2.jar:0.8.2]
at io.druid.cli.Main.main(Main.java:91) ~[druid-services-0.8.2.jar:0.8.2]
… 6 more

Below is my spec file - I assume something wrong with it - can you help?

{

“dataSchema” : {

“dataSource” : “activity”,

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “tsv”,

“columns”: [“timestamp”, “eventid”, “eventtype”, “userid”],

“timestampSpec” : {

“column” : “timestamp”,

“format” : “auto”

},

“dimensionsSpec” : {

“dimensions”: [“eventid”, “eventtype”, “userid”],

“dimensionExclusions” : ,

“spatialDimensions” :

}

}

},

“metricsSpec” : [{

“type” : “count”,

“name” : “count”

}],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “DAY”,

“queryGranularity” : “HOUR”,

“intervals” : [ “2014-08-31/2016-01-01” ]

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “s3n://bucket/path/to/input/”

},

“metadataUpdateSpec” : {

“type”:“mysql”,

“connectURI” : “jdbc:mysql://{DBHOST}:3306/druid”,

“password” : “d4tateam1”,

“segmentTable” : “druid_segments”,

“user” : “druid”

},

“segmentOutputPath” : “s3n://bucket/path/to/output/”

},

“tuningConfig” : {

“type” : “hadoop”,

“workingPath”: “/tmp”,

“partitionsSpec” : {

“type” : “dimension”,

“partitionDimension” : “eventid”,

“targetPartitionSize” : 5000000,

“maxPartitionSize” : 7500000,

“assumeGrouped” : false,

“numShards” : -1

},

“shardSpecs” : { },

“leaveIntermediate” : false,

“cleanupOnFailure” : true,

“overwriteFiles” : false,

“ignoreInvalidRows” : false,

“jobProperties” : { },

“combineText” : false,

“persistInHeap” : false,

“ingestOffheap” : false,

“bufferSize” : 134217728,

“aggregationBufferRatio” : 0.5,

“rowFlushBoundary” : 300000

}

}

``

EMR needs additional overrides.

Info here: http://imply.io/docs/latest/ingestion.html

Thanks Fangjin, however after following the steps, I’m still facing the same error.

What does “No buckets?” mean?

Also, which nodes are required to be running while EMR indexing is taking place?

Hey Dan,

“No buckets?? seems there is no data to index.” generally means that none of your input data matched the indexing spec. The most common reason for that is that your timestamps are all out of range, or are in a format that doesn’t match your timestampSpec. Your intervals are set to “2014-08-31/2016-01-01” and your timestampSpec is set to look in “timestamp” for iso8601 format or millis format (“auto”). Can you confirm that your data has timestamps matching that format and that range?

Thanks Gian - that was indeed a problem, timestamp was in seconds, not milliseconds.