Difficulty ingesting batch data, not able to locate an explicit error.

Hello,

I have been testing out Druid and I have run into the following issue. I am trying to batch ingest a single day of data spread across three hive sequence files (the files are each 9GB) stored in s3. The job is listed as having failed, but I am having a hard time location the cause.

I am running Druid 0.7.0.

I am currently seeing the following in the Overlord’s logs that seems to be relevant:

2015-04-22T03:20:25,212 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[host:port] wrote FAILED status for task: index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z

2015-04-22T03:20:25,212 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[host:port] completed task[index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z] with status[FAILED]

2015-04-22T03:20:25,212 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z

2015-04-22T03:20:25,213 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z] on worker[host:port]

2015-04-22T03:20:25,215 INFO [PathChildrenCache-0] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z to status: TaskStatus{id=index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z, status=FAILED, duration=0}

2015-04-22T03:20:25,220 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z] from TaskLock[index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z]

2015-04-22T03:20:25,220 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - TaskLock is now empty: TaskLock{groupId=index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z, dataSource=one_day_synthetic_traffic_agg, interval=2015-02-01T00:00:00.000Z/2015-02-02T00:00:00.000Z, version=2015-04-21T21:10:40.522Z}

2015-04-22T03:20:25,221 INFO [PathChildrenCache-0] io.druid.indexing.overlord.MetadataTaskStorage - Deleting TaskLock with id[7]: TaskLock{groupId=index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z, dataSource=one_day_synthetic_traffic_agg, interval=2015-02-01T00:00:00.000Z/2015-02-02T00:00:00.000Z, version=2015-04-21T21:10:40.522Z}

2015-04-22T03:20:25,224 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task done: IndexTask{id=index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z, type=index, dataSource=one_day_synthetic_traffic_agg}

2015-04-22T03:20:25,224 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task FAILED: IndexTask{id=index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z, type=index, dataSource=one_day_synthetic_traffic_agg} (0 run duration)

2015-04-22T03:20:25,224 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z] went bye bye.

And the log for the relevant task just ends with the following:

2015-04-22T03:20:18,845 INFO [task-runner-0] io.druid.indexing.common.index.YeOldePlumberSchool - Spilling index[508] with rows[500000] to: /mnt/persistent/task/index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z/work/one_day_synthetic_traffic_agg_2015-02-01T00:00:00.000Z_2015-02-02T00:00:00.000Z_2015-04-21T21:10:40.522Z_0/one_day_synthetic_traffic_agg_2015-02-01T00:00:00.000Z_2015-02-02T00:00:00.000Z_2015-04-21T21:10:40.522Z/spill508

2015-04-22T03:20:18,846 INFO [task-runner-0] io.druid.segment.IndexMaker - Starting persist for interval[2015-02-01T00:00:00.000Z/2015-02-01T00:00:00.001Z], rows[500,000]

2015-04-22T03:20:19,323 INFO [task-runner-0] io.druid.segment.LoggingProgressIndicator - Starting [/mnt/persistent/task/index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z/work/one_day_synthetic_traffic_agg_2015-02-01T00:00:00.000Z_2015-02-02T00:00:00.000Z_2015-04-21T21:10:40.522Z_0/one_day_synthetic_traffic_agg_2015-02-01T00:00:00.000Z_2015-02-02T00:00:00.000Z_2015-04-21T21:10:40.522Z/spill508]

2015-04-22T03:20:19,323 INFO [task-runner-0] io.druid.segment.LoggingProgressIndicator - [/mnt/persistent/task/index_one_day_synthetic_traffic_agg_2015-04-21T21:10:40.514Z/work/one_day_synthetic_traffic_agg_2015-02-01T00:00:00.000Z_2015-02-02T0

Otherwise I see no exceptions or errors out of the ordinary.

I have used a modified setup suggested from http://druid.io/docs/0.7.0/Production-Cluster-Configuration.html

my common.runtime.properties:

druid.extensions.coordinates=[“io.druid.extensions:druid-s3-extensions”, “io.druid.extensions:druid-histogram”, “io.druid.extensions:mysql-metadata-storage”]

druid.zk.service.host=#{MY_ZK_HOST}

druid.zk.paths.base=/druid/prod

druid.discovery.curator.path=/prod/discovery

druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]

druid.emitter=noop

druid.metadata.storage.type=mysql

druid.metadata.storage.connector.connectURI=jdbc:mysql://#{HOST}:#{MYSQL_PORT}/#{DB}?characterEncoding=UTF-8

druid.metadata.storage.connector.user=#{DB_USER}

druid.metadata.storage.connector.password=#{DB_PASSWORD}

druid.storage.type=s3

druid.s3.accessKey=#{AWS_ACCESS_KEY_ID}

druid.s3.secretKey=#{AWS_SECRET_KEY}

druid.cache.type=local

druid.cache.sizeInBytes=0

druid.selectors.indexing.serviceName=druid:prod:overlord

my overlord/runtime.properties

druid.host=#{HOST}:#{PORT}
druid.port=#{PORT}
druid.service=druid/prod/overlord

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=#{MY_BUCKET}
druid.indexer.logs.s3Prefix=#{MY_PREFIX}
druid.indexer.runner.type=remote
druid.indexer.storage.type=metadata

my broker/runtime.properties

druid.host=#{HOST}:#{PORT}
druid.port=#{PORT}
druid.service=druid/prod/broker

druid.broker.http.numConnections=20
druid.broker.http.readTimeout=PT5M

druid.processing.buffer.sizeBytes=2147483647
druid.processing.numThreads=31
druid.server.http.numThreads=50

my middleManager/runtime.properties:
druid.host=#{HOST}:#{PORT}
druid.port=#{PORT}
druid.service=druid/prod/middlemanager

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=#{BUCKET}
druid.indexer.logs.s3Prefix=#{PREFIX}

druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
druid.indexer.task.baseTaskDir=/mnt/persistent/task/

druid.indexer.fork.property.druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912
druid.indexer.fork.property.druid.processing.numThreads=2
druid.indexer.fork.property.druid.segmentCache.locations=[{“path”: “/mnt/persistent/zk_druid”, “maxSize”: 0}]
druid.indexer.fork.property.druid.server.http.numThreads=50
druid.indexer.fork.property.druid.storage.archiveBaseKey=#{ARCHIVE_BASE_KEY}
druid.indexer.fork.property.druid.storage.archiveBucket=#{ARCHIVE_BUCKET}
druid.indexer.fork.property.druid.storage.baseKey=#{STORAGE_BASE_KEY}
druid.indexer.fork.property.druid.storage.bucket=#{STORAGE_BUCKET}
druid.indexer.fork.property.druid.storage.type=s3

druid.worker.capacity=9

My java commands for each of these are the same as in the suggested production cluster docs. I believe this is all the relevant information, but please let me know if there is anything else I can provide.

I should also mention that I have been able to ingest smaller subsets of data consisting of a single file.

Hi Justin,

The task logs should contain details about what went wrong. These should be available from the overlord’s web console; under “running tasks” and “complete tasks” you can click “log (all)” to download the log. Would you mind posting those?

One other thing to keep in mind, separate from the debugging, is that the plain Druid “index” task is generally not very efficient on large amounts of batch data. If you have a Hadoop cluster available, the “index_hadoop” task is much more efficient and is the one recommended for production use.

Thank you for looking into this. I have since attempted to try to spin up a HadoopDruidIndexer, but I seem to be running into complications.

I have the same set up as above, but I have also booted up a Amazon EMR cluster, running Amazon’s 2.4.0 Hadoop distro. I log into the master node, I download and extract druid, I update the common.runtime.properties to be consistent with above, I write up a hadoop index spec containing my data schema, as well as the location of metadata storage and segment storage.

I then try running:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/:/home/hadoop/conf/ io.druid.cli.Main index hadoop index_hadoop_spec.json

on the master node.

I get the following error when attempting this:

WARNING: Multiple Servlet injectors detected. This is a warning indicating that you have more than one GuiceFilter running in your web application. If this is deliberate, you may safely ignore this message. If this is NOT deliberate however, your application may not work as expected.

2015-05-07T17:37:45,447 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@7778463a]

2015-05-07T17:37:45,458 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=}]

2015-05-07T17:37:45,481 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.server.DruidNode] from props[druid.] as [DruidNode{serviceName=‘druid/internal-hadoop-indexer’, host=’#{INTERNAL_IP}’, port=0}]

2015-05-07T17:37:45,489 ERROR [main] io.druid.cli.CliHadoopIndexer - failure!!!

java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.7.0_71]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[?:1.7.0_71]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.7.0_71]

at java.lang.reflect.Method.invoke(Method.java:606) ~[?:1.7.0_71]

at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:120) [druid-services-0.7.0.jar:0.7.0]

at io.druid.cli.Main.main(Main.java:88) [druid-services-0.7.0.jar:0.7.0]

Caused by: com.google.inject.ProvisionException: Guice provision errors:

  1. Unknown provider[metadata] of Key[type=io.druid.metadata.SQLMetadataConnector, annotation=[none]], known options[[derby]]

at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:67)

while locating io.druid.metadata.SQLMetadataConnector

for parameter 0 at io.druid.indexer.SQLMetadataStorageUpdaterJobHandler.<init>(SQLMetadataStorageUpdaterJobHandler.java:42)

while locating io.druid.indexer.SQLMetadataStorageUpdaterJobHandler

at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:67)

while locating io.druid.indexer.MetadataStorageUpdaterJobHandler

1 error

at com.google.inject.internal.InjectorImpl$3.get(InjectorImpl.java:1014) ~[guice-4.0-beta.jar:?]

at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1040) ~[guice-4.0-beta.jar:?]

at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:103) ~[druid-services-0.7.0.jar:0.7.0]

at io.druid.cli.Main.main(Main.java:88) ~[druid-services-0.7.0.jar:0.7.0]

… 6 more

Am I doing something wrong? Do you have any suggestions? Any help is greatly appreciated!

Thank you!

This seems like a mysql-metadata-configuration problem. You may need to add a “metadataUpdateSpec” to your index_hadoop_spec.json or add the mysql-metadata-storage jar to your classpath (or both).

Actually, someone in the irc channel told me to put “derby” as the type in the metadataSpec, which seemed to solve this initial problem.

Hi Justin, after you changed the metadata store to ‘derby’, what are the errors you are hitting? Derby is the default metadata storage, and we use it for testing, but it is not very suitable for production. It does seem like the extension is not getting picked up, so gian’s suggestions may be worth trying.