Druid Hadoop Indexer Issue

I've been trying to use Druid with Hadoop via the indexer. I've been able to do batches with local files and realtime with Kafka so far.

When I submit a task, it stays in running for a long time (30+ minutes so far). I'm using this tutorial:

http://druid.io/docs/latest/ingestion/batch-ingestion.html

Here's a snippet from the logs, any advice would be appreciated!

<http://druid.io/docs/latest/ingestion/batch-ingestion.html><http://druid.io/docs/latest/ingestion/batch-ingestion.html><http://druid.io/docs/latest/ingestion/batch-ingestion.html><http://druid.io/docs/latest/ingestion/batch-ingestion.html>http://druid.io/docs/latest/ingestion/batch-ingestion.html

["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Eden Space"}]
2016-08-01T20:27:53,356 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.356Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/committed","value":445644800,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Eden Space"}]
2016-08-01T20:27:53,356 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.356Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/used","value":337379344,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Eden Space"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.356Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/init","value":131596288,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Eden Space"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/max","value":8912896,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Survivor Space"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/committed","value":8912896,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Survivor Space"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/used","value":8707200,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Survivor Space"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/init","value":21495808,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Survivor Space"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/max","value":1431830528,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Old Gen"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/committed","value":377487360,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Old Gen"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/used","value":18413048,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Old Gen"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/pool/init","value":349700096,"dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"],"poolKind":"heap","poolName":"PS Old Gen"}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/gc/count","value":0,"dataSource":["wikipedia"],"gcName":"PS Scavenge","id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,357 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/gc/time","value":0,"dataSource":["wikipedia"],"gcName":"PS Scavenge","id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.357Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/gc/count","value":0,"dataSource":["wikipedia"],"gcName":"PS MarkSweep","id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.358Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/gc/time","value":0,"dataSource":["wikipedia"],"gcName":"PS MarkSweep","id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.358Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/bufferpool/capacity","value":1716300,"bufferpoolName":"direct","dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.358Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/bufferpool/used","value":1716300,"bufferpoolName":"direct","dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.358Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/bufferpool/count","value":56,"bufferpoolName":"direct","dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.358Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/bufferpool/capacity","value":0,"bufferpoolName":"mapped","dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.358Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/bufferpool/used","value":0,"bufferpoolName":"mapped","dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]
2016-08-01T20:27:53,358 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-01T20:27:53.358Z","service":"druid/middleManager","host":"localhost:8101","metric":"jvm/bufferpool/count","value":0,"bufferpoolName":"mapped","dataSource":["wikipedia"],"id":["index_hadoop_wikipedia_2016-08-01T19:51:49.918Z"]}]

Hi Ian, is the task failing or just it take a long time to run?

By default, you will use a local hadoop cluster to index the data, which can be very slow. For larger files and more real world workloads, we recommend using an existing remote Hadoop cluster to run the indexing.

You can configure Druid to talk to a remote Hadoop cluster by including the config xml files in Druid’s startup classpath.

http://druid.io/docs/0.9.1.1/tutorials/cluster.html

There’s also a tutorial here: http://druid.io/docs/0.9.1.1/tutorials/tutorial-batch.html

FWIW, most of Druid’s getting started docs are based off of Imply’s getting started guide, so there may be additional information there.

The tasks haven't been failing, but some ran over 8 hours at which point I killed them, since it was only a few Mb or less of data.

I'm attempting to use a remote cluster presently.

Thanks,

Ian