Questions regarding kafka indexing service

Hey,

We have few questions regarding kafka indexing service

  1. In kafka indexing service, if the indexing task is not able to complete or it got failed for some reason. Is there any loss in data ?. Will the data which the task is able to index would be persisted in near future task?

  2. If there is completion timeout, will the data will persisted to deep storage in near future task?. We have changed the supervisor spec and increases the timeout , but what about the data which got failed because of this completion timeout.

Thanks,

Saurabh

Hi Saurabh,

If the indexing task fails for any reason (including a completion timeout) another task will be spawned that will re-read the same offsets from Kafka previously read by the failed task and will try publishing the segments again. There are protections in place to prevent offsets from being skipped when tasks fail. As long as the Kafka message retention period is long enough that the retry task can still access the data, no data will be lost.

Okay…That helps. But we try to increase the completionTimeout but still it is not updating .We have updated by posting the supervisor spec again. Do we need to start again some nodes.

You shouldn’t have to do anything other than reposting the supervisor spec.

Can you post the logs from your overlord and one of the tasks that isn’t completing?

Also if you’re using MySQL, make sure that you’re using the mysql-metadata-storage extension that corresponds to the version of Druid you downloaded (and not one from a previous version). There’ve been a few other reports of handoffs failing because of incompatible mysql extensions between 0.9.0 and 0.9.1.1.

We have checked that we are using the current version of mysql storage i.e 0.9.1.1. Here are the logs for the failed task we are getting :

2016-09-15T00:47:45,240 INFO [appenderator_persist_0] io.druid.curator.announcement.Announcer - unannouncing [/druid/segments/druid-middle-manager-002.c.inshorts-1374.internal:8101/druid-middle-manager-002.c.inshorts-1374.internal:8101_indexer-executor__default_tier_2016-09-14T23:33:36.463Z_04fdc625912347859257d0303042dfd70]

2016-09-15T00:47:45,252 INFO [appenderator_persist_0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Removing sink for segment[prism-data-6_2016-08-23T00:00:00.000Z_2016-08-24T00:00:00.000Z_2016-09-14T13:55:58.395Z_39].

2016-09-15T00:47:45,255 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KafkaIndexTask{id=index_kafka_prism-data-6_1191ad8fce5b84a_befmfbbi, type=index_kafka, dataSource=prism-data-6}]

com.metamx.common.ISE: Transaction failure publishing segments, aborting

at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:506) ~[?:?]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.2-SNAPSHOT.jar:0.9.1.2-SNAPSHOT]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.2-SNAPSHOT.jar:0.9.1.2-SNAPSHOT]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_101]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_101]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_101]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]

2016-09-15T00:47:45,260 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_prism-data-6_1191ad8fce5b84a_befmfbbi] status changed to [FAILED].

2016-09-15T00:47:45,262 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_kafka_prism-data-6_1191ad8fce5b84a_befmfbbi”,

“status” : “FAILED”,

“duration” : 4449346

}

2016-09-15T00:47:45,266 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.coordination.AbstractDataSegmentAnnouncer.stop()] on object[io.druid.server.coordination.BatchDataSegmentAnnouncer@6221b13b].

2016-09-15T00:47:45,266 INFO [main] io.druid.server.coordination.AbstractDataSegmentAnnouncer - Stopping class io.druid.server.coordination.BatchDataSegmentAnnouncer with config[io.druid.server.initialization.ZkPathsConfig@22e2266d]

2016-09-15T00:47:45,266 INFO [main] io.druid.curator.announcement.Announcer - unannouncing [/druid/announcements/druid-middle-manager-002.c.inshorts-1374.internal:8101]

2016-09-15T00:47:45,268 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.listener.announcer.ListenerResourceAnnouncer.stop()] on object[io.druid.query.lookup.LookupResourceListenerAnnouncer@31b0f02].

Hey Saurabh,

Can you post the full task log? The full log might have more details about why there was a transaction failure.

If you shut down the supervisor, wait for / kill all related indexing tasks, and restart it, does it help?

Log file can be found at : https://drive.google.com/file/d/0B3tqEupfAPVhMEhQcVd6OEpkRGs/view?usp=sharing.
This is happening for some task. Not all of them. What could be the possible issue.

Hey Saurabh,

Does this always happen for the same datasource or does it happen for different datasources?

Do you happen to have the overlord logs around the time when the task failed? Those would be helpful.

It is happening for the same datasource. No we don’t have any overlord logs for that time. I think so this issue is related to connection timeout . I tried to increase and no task is failing because of this.
But even though my task is getting succeeded and I am able to see the some of the segments both in historical and deep storage nodes , there are lot of pending segments I am seeing in mysql and it increases day by day. What could be the reason for this. We have checked the logs on all nodes, not able to figure much from the logs as there is no error. Even there is space available on historical nodes for segments hand off. Any ideas what could be the cause.

Hey Saurabh,

Glad to hear things are working for you. Regarding the pending segments table, the entries in the table are not actually removed upon handoff so the existence of rows in the table doesn’t indicate that handoff failed. If you get enough entries in there that you start to see issues, you can periodically purge the old entries manually.