Hadoop Batching Indexing help

I have been trying to load data using the HadoopDruidIndexer.

No Luck.

I read the thread about re-building Druid separating out the version dependency on fasterxml.jackson. The build.sbt file worked out find. I build a standalone jar and put it at the front of the class path. However, it errors out.

Here is the thread that discusses the strategy of getting around the fasterxml dependencies.
https://groups.google.com/forum/#!searchin/druid-user/hadoop|sort:date/druid-user/UM-Cgj750sY/dQU4zdhhExEJ

I have the jar in the directory druid_assembly and I am trying to load the wikipedia example.

Here is the error:

java -server -Xmx1g -XX:+UseConcMarkSweepGC -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/tmp -classpath /home/ubuntu/druid_assembly/:/opt/shaman/packages/druid/lib/:/opt/shaman/config/overlord:/etc/hadoop/conf.shaman:/home/ubuntu/druid_conf io.druid.cli.Main index hadoop wikipedia_index_hadoop_task.json

015-07-07T06:11:08,678 INFO [main] io.druid.initialization.Initialization - Adding local module[class io.druid.metadata.storage.mysql.MySQLMetadataStorageModule]

2015-07-07T06:11:08,678 INFO [main] io.druid.initialization.Initialization - Adding local module[class io.druid.storage.s3.S3StorageDruidModule]

2015-07-07T06:11:08,679 INFO [main] io.druid.initialization.Initialization - Adding local module[class io.druid.firehose.s3.S3FirehoseDruidModule]

2015-07-07T06:11:08,679 INFO [main] io.druid.initialization.Initialization - Adding local module[class io.druid.query.aggregation.histogram.ApproximateHistogramDruidModule]

2015-07-07T06:11:08,679 INFO [main] io.druid.initialization.Initialization - Adding local module[class io.druid.storage.hdfs.HdfsStorageDruidModule]

2015-07-07T06:11:09,093 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@47c653ee]

2015-07-07T06:11:09,104 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=}]

2015-07-07T06:11:09,257 INFO [main] io.druid.guice.PropertiesModule - Loading properties from common.runtime.properties

2015-07-07T06:11:09,258 INFO [main] io.druid.guice.PropertiesModule - Loading properties from runtime.properties

2015-07-07T06:11:09,265 INFO [main] org.skife.config.ConfigurationObjectFactory - Using method itself for [druid.computation.buffer.size, ${base_path}.buffer.sizeBytes] on [io.druid.query.DruidProcessingConfig#intermediateComputeSizeBytes()]

2015-07-07T06:11:09,267 INFO [main] org.skife.config.ConfigurationObjectFactory - Using method itself for [${base_path}.numThreads] on [io.druid.query.DruidProcessingConfig#getNumThreads()]

2015-07-07T06:11:09,267 INFO [main] org.skife.config.ConfigurationObjectFactory - Using method itself for [${base_path}.columnCache.sizeBytes] on [io.druid.query.DruidProcessingConfig#columnCacheSizeBytes()]

2015-07-07T06:11:09,267 INFO [main] org.skife.config.ConfigurationObjectFactory - Assigning default value [processing-%s] for [${base_path}.formatString] on [com.metamx.common.concurrent.ExecutorServiceConfig#getFormatString()]

2015-07-07T06:11:09,341 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[interface io.druid.segment.data.BitmapSerdeFactory] from props[druid.processing.bitmap.] as [ConciseBitmapSerdeFactory{}]

2015-07-07T06:11:09,462 ERROR [main] io.druid.cli.CliHadoopIndexer - failure!!!

java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_79]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_79]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_79]

at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_79]

at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:120) [DruidAssembly-SBT-assembly-1.0.jar:1.0]

at io.druid.cli.Main.main(Main.java:88) [DruidAssembly-SBT-assembly-1.0.jar:1.0]

Caused by: com.google.inject.CreationException: Guice creation errors:

  1. Binding to null instances is not allowed. Use toProvider(Providers.of(null)) if this is your intended behaviour.

at io.druid.cli.CliInternalHadoopIndexer$1.configure(CliInternalHadoopIndexer.java:83)

1 error

at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:448) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:155) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:107) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at com.google.inject.Guice.createInjector(Guice.java:96) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at com.google.inject.Guice.createInjector(Guice.java:73) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at com.google.inject.Guice.createInjector(Guice.java:62) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at io.druid.initialization.Initialization.makeInjectorWithModules(Initialization.java:369) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at io.druid.cli.GuiceRunnable.makeInjector(GuiceRunnable.java:55) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:94) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

at io.druid.cli.Main.main(Main.java:88) ~[DruidAssembly-SBT-assembly-1.0.jar:1.0]

… 6 more

I am not sure just what the problem is. It’s clearly trying to use Guice to link modules, but I thought that the self-contained jar solves all of that.

Any help that you can provide would be very much appreciated!

Johnny Hom

I just realized that I was using the example wikipedia_index_hadoop_task.json example and that this is actually a index service job and not the HadoopDruidIndexer job.

Now that I’m using the right one, it actually works :slight_smile:

Hi Jhonny,
Can you tell how you used HadoopDruidIndexer? What is the command line and spec file.
I am having the same problem. But I could not find any HadoopDruidIndexer in examples/ directory.

Thanks for your help.
-Vinay

Hi Vinay,
First of all please read this thread https://groups.google.com/forum/#!searchin/druid-user/hadoop|sort:date/druid-user/UM-Cgj750sY/dQU4zdhhExEJ

Apparently, there are version conflicts involving fasterxml that need to be resolved by recompiling your own fat jar that is free of dependency conflicts.

The authors of the post were really wonderful in supplying the build.sb file you need to compile your own jar.

Google sbt and it is fairly quick to get started. Once you compile your fat jar, you can then run it with command line.

Then you need to build a “spec” file to run with: http://druid.io/docs/latest/ingestion/batch-ingestion.html

Key point: there are TWO ways of batch loading using Hadoop. One is HadoopDruidIndexer and the other is using the indexing service and sending and indexing job.

In the documentation, use the one called Hadoop “specFile” which is above, and NOT the one with “type” : “index_hadoop”.

Once you have configured your jar and your specFile, you can run a command line something like:

java -server -Xmx1g -XX:+UseConcMarkSweepGC -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/tmp -classpath $DRUID_ASSEMBLY_HOME/DruidAssembly-1.0.jar io.druid.cli.Main index hadoop $HADOOP_JOB

Of course, fill in your own configs.

Good luck!!!

Thanks Johnny, Still trying to figure out what the difference is between index_hadoop job and HadoopDruidIndexer job.

Thanks,

-Vinay

Hi Vinay, the index_hadoop job runs in the indexing service and the HadoopDruidIndexer runs the same code standalone.

http://druid.io/docs/latest/ingestion/batch-ingestion.html