Upgrade to Druid 0.8.1 - Unable to make the Hadoop indexer to work

Hi Torche, if you take a look at the 0.6.x branch, there should be a tool you can run to convert the old index task spec to the new one.

Hi all,

We are trying to upgrade our cluster from 0.7.x to 0.8.1. We are currently using the Hadoop indexer for our batch pipelines (we are using version 0.6.171).

It looks like the indexing task spec has changed and I cannot use the specification I am using with the jar 0.6.171 of the Druid project.

I have looked at the documentation and updated my indexing task to follow the new specs. However I get this error for the third job of the batch indexing process:

2015-11-02 19:11:55,911 ERROR [main] cli.CliHadoopIndexer (Logger.java:error(98)) - failure!!!

java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:120)

at io.druid.cli.Main.main(Main.java:91)

Caused by: java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.

at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:207)

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182)

at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96)

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182)

at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:132)

at io.druid.cli.Main.main(Main.java:91)

… 6 more

Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.

at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:159)

  1. First of all, can someone explain me what are the purpose of each job in the batch ingestion process? (determine_partitions_groupby, determine_partitions_dimselection …).

There’s 3 stages: find interval of data to index (optional), determine how partitions should be created, and create the segments.

  1. What does this error mean? I am sure the interval in granularitySpec is correct because I am using the same interval and the same input path for my data when I ingest data with version 0.6.171.

Error is most likely related to outdated Druid spec.

  1. Is it possible to continue ingesting my data with the Hadoop indexer version 0.6.171 whereas my cluster will be 0.8.1?

No, the entire spec changed. You have to use the new ingestion spec.

  1. In the metadataUpdateSpec spec I tried to use “mysql” or “db” for the type but it did not work. I am now using derby. What does it stand for? Is it the right way to do it if my metadata storage is MySQL?

These release notes might help from 0.6.x ingestion to 0.7.x: https://github.com/druid-io/druid/releases/tag/druid-0.7.0. Derby should never be used for production.

Hi Frangjin,

Thanks for your answer.

I have converted my old hadoop ingestion spec to the new format. However I still get an exception when I specify the metadata storage type.

Let me summarize what I am trying to achieve so you have a better understanding of my problem.

My batch pipeline doesn’t rely on my indexing service (Hadoop is not configured there). What I do is spinning up an EMR cluster whenever I want to run a batch pipeline, then a bootstrap action installs druid on the master node and the following command is ran on this same node to submit the hadoop ingestion task using the Druid cli hadoop indexer:

java -Xmx256m -Duser.timezone=PST -Dfile.encoding=UTF-8 -classpath /home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/yarn/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/yarn/lib/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/tools/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/tools/lib/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/hdfs/lib/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/hdfs/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/mapreduce/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/common/:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/common/lib/:/home/hadoop/conf/:/home/hadoop/druid-services/lib/ -Dhadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem -Dhadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem -Dfs.s3n.awsAccessKeyId=***** -Dfs.s3n.awsSecretAccessKey=***** io.druid.cli.Main index hadoop --no-default-hadoop specFile

where specFile is:

{

“type” : “index_hadoop”,

“spec”: {

“dataSchema” : {

“dataSource” : “rtb_bids”,

“parser” : {

“type” : “hadoopyString”,

“parseSpec” : {

“format” : “tsv”,

“timestampSpec” : {

“column” : “timestamp”,

“format” : “auto”

},

“columns”:[

“timestamp”,

“aws_region”,

“tracking_id”,

“zone_type_id”,

“domain”,

“publisher_id”,

“country_code”,

“bidder_id”,

“advertiser_id”,

“unit_type_id”,

“product_id”,

“deal_id”,

“browser_type_id”,

“vertical_id”,

“width”,

“height”,

“bids”,

“wins”,

“total_bid_price”,

“total_win_price”,

“total_settlement_price”

],

“dimensionsSpec” : {

“dimensions”: [

“aws_region”,

“tracking_id”,

“zone_type_id”,

“domain”,

“publisher_id”,

“country_code”,

“bidder_id”,

“advertiser_id”,

“unit_type_id”,

“product_id”,

“deal_id”,

“browser_type_id”,

“vertical_id”,

“width”,

“height”

],

“dimensionExclusions” : ,

“spatialDimensions” :

}

}

},

“metricsSpec” : [

{

“type”:“longSum”,

“name”:“bids”,

“fieldName”:“bids”

},

{

“type”:“longSum”,

“name”:“wins”,

“fieldName”:“wins”

},

{

“type”:“doubleSum”,

“name”:“total_bid_price”,

“fieldName”:“total_bid_price”

},

{

“type”:“doubleSum”,

“name”:“total_win_price”,

“fieldName”:“total_win_price”

},

{

“type”:“doubleSum”,

“name”:“total_settlement_price”,

“fieldName”:“total_settlement_price”

}

],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “HOUR”,

“queryGranularity” : “NONE”,

“intervals” : [ “2015-11-02T00:00:00.000-08:00/2015-11-02T04:00:00.000-08:00” ]

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “s3n://gumgum-elastic-mapreduce/druid/rtbevents/output/2015-11-02-00/bids/*”

},

“metadataUpdateSpec” : {

“type”:“mysql”,

“connectURI” : “**********”,

“password” : “********”,

“segmentTable” : “prod_segments”,

“user” : “****”

},

“segmentOutputPath” : “s3n://gumgum-druid/prod-segments-druid-08”

},

“tuningConfig” : {

“type” : “hadoop”,

“workingPath”: “/tmp/gumgum-druid/”,

“partitionsSpec” : {

“type” : “dimension”,

“partitionDimension” : null,

“targetPartitionSize” : 5000000,

“maxPartitionSize” : 7500000,

“assumeGrouped” : false,

“numShards” : -1

},

“shardSpecs” : { },

“leaveIntermediate” : false,

“cleanupOnFailure” : true,

“overwriteFiles” : false,

“ignoreInvalidRows” : false,

“jobProperties” : { },

“combineText” : false,

“persistInHeap” : false,

“ingestOffheap” : false,

“bufferSize” : 134217728,

“aggregationBufferRatio” : 0.5,

“rowFlushBoundary” : 300000

}

}

}

It looks to me this spec file is correct and follow the new specs. However I get an Exception saying that the provider for the sql metadata connector is not recognized and that the only option is derby. Here is full trace:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/hadoop/druid-services/lib/log4j-slf4j-impl-2.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

2015-11-05 13:57:26,321 INFO [main] util.Version (Version.java:(27)) - HV000001: Hibernate Validator 5.1.3.Final

2015-11-05 13:57:27,307 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=, defaultVersion=‘0.8.1’, localRepository=’/home/hadoop/.m2/repository’, remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/hadoop/.versions/2.4.0-amzn-7/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/hadoop/druid-services/lib/log4j-slf4j-impl-2.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

2015-11-05 13:57:28,117 INFO [main] util.Version (Version.java:(27)) - HV000001: Hibernate Validator 5.1.3.Final

2015-11-05 13:57:28,923 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=, defaultVersion=‘0.8.1’, localRepository=’/home/hadoop/.m2/repository’, remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]

2015-11-05 13:57:29,882 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=, defaultVersion=‘0.8.1’, localRepository=’/home/hadoop/.m2/repository’, remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]

2015-11-05 13:57:30,694 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@737361b8]

2015-11-05 13:57:30,708 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=}]

2015-11-05 13:57:31,044 INFO [main] config.ConfigurationObjectFactory (ConfigurationObjectFactory.java:buildSimple(162)) - Using method itself for [druid.computation.buffer.size, ${base_path}.buffer.sizeBytes] on [io.druid.query.DruidProcessingConfig#intermediateComputeSizeBytes()]

2015-11-05 13:57:31,049 INFO [main] config.ConfigurationObjectFactory (ConfigurationObjectFactory.java:buildSimple(162)) - Using method itself for [${base_path}.numThreads] on [io.druid.query.DruidProcessingConfig#getNumThreads()]

2015-11-05 13:57:31,049 INFO [main] config.ConfigurationObjectFactory (ConfigurationObjectFactory.java:buildSimple(162)) - Using method itself for [${base_path}.columnCache.sizeBytes] on [io.druid.query.DruidProcessingConfig#columnCacheSizeBytes()]

2015-11-05 13:57:31,050 INFO [main] config.ConfigurationObjectFactory (ConfigurationObjectFactory.java:buildSimple(151)) - Assigning default value [processing-%s] for [${base_path}.formatString] on [com.metamx.common.concurrent.ExecutorServiceConfig#getFormatString()]

2015-11-05 13:57:31,234 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[interface io.druid.segment.data.BitmapSerdeFactory] from props[druid.processing.bitmap.] as [ConciseBitmapSerdeFactory{}]

Nov 05, 2015 1:57:31 PM com.google.inject.servlet.GuiceFilter setPipeline

WARNING: Multiple Servlet injectors detected. This is a warning indicating that you have more than one GuiceFilter running in your web application. If this is deliberate, you may safely ignore this message. If this is NOT deliberate however, your application may not work as expected.

2015-11-05 13:57:31,391 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@77c143b0]

2015-11-05 13:57:31,402 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=}]

2015-11-05 13:57:31,422 INFO [main] guice.JsonConfigurator (Logger.java:info(70)) - Loaded class[class io.druid.server.DruidNode] from props[druid.] as [DruidNode{serviceName=‘druid/internal-hadoop-indexer’, host=‘ip-10-81-200-79.ec2.internal’, port=0}]

2015-11-05 13:57:31,428 ERROR [main] cli.CliHadoopIndexer (Logger.java:error(98)) - failure!!!

java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:120)

at io.druid.cli.Main.main(Main.java:91)

Caused by: com.google.inject.ProvisionException: Guice provision errors:

  1. Unknown provider[mysql] of Key[type=io.druid.metadata.SQLMetadataConnector, annotation=[none]], known options[[derby]]

at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:67)

while locating io.druid.metadata.SQLMetadataConnector

for parameter 2 at io.druid.metadata.IndexerSQLMetadataStorageCoordinator.(IndexerSQLMetadataStorageCoordinator.java:69)

while locating io.druid.metadata.IndexerSQLMetadataStorageCoordinator

at io.druid.cli.CliInternalHadoopIndexer$1.configure(CliInternalHadoopIndexer.java:98)

while locating io.druid.indexing.overlord.IndexerMetadataStorageCoordinator

1 error

at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:987)

at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1013)

at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:119)

at io.druid.cli.Main.main(Main.java:91)

… 6 more

Do you have an idea why I get this error? It seems like only the derby provider is supported. I know the 0.7.x version of Druid introduced derby as the metadata storage by default but it should be possible to use another metadata storage with the cli hadoop indexer right?

Thanks for your help!

Hi Torche,

I think you need to specify the mysql extension when starting the indexer, can you try adding:

-Ddruid.extensions.coordinates=[“io.druid.extensions:mysql-metadata-storage”]

to the launching command?

These following links might be helpful for that error:

http://druid.io/docs/latest/dependencies/metadata-storage.html

http://druid.io/docs/latest/operations/including-extensions.html

  • Jon

Thanks it worked!