Kafka indexing fails with com.metamx.common.ISE: Transaction failure publishing segments, aborting


I am using the default Derby Database for metadata. All of these issues started when i tried to reload some historical data with batch indexer for better compaction. Just to make sure .. i went in and deleted everything from  the following tables

DRUID_PENDINGSEGMENTS
DRUID_TASKS
DRUID_TASKLOGS
DRUID_SUPERVISORS

and resubmitted supervisor .. still no luck !


Relevant Logs  - Overlord

2016-10-04T14:47:34,392 INFO [qtp757779849-196] io.druid.metadata.IndexerSQLMetadataStorageCoordinator - Not updating metadata, existing state is not the expected start state.
2016-10-04T14:47:34,393 INFO [qtp757779849-196] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-04T14:47:34.393Z","service":"druid/overlord","host":"gb-slo-svb-0187.dunnhumby.co.uk:8090","metric":"segment/txn/failure","value":1,"dataSource":"tuk_real","taskType":"index_kafka"}]
2016-10-04T14:47:34,413 INFO [qtp757779849-197] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_tuk_real_f67568d067497e8_ocooldcn]: SegmentListUsedAction{dataSource='tuk_real', intervals=[2016-10-03T00:00:00.000Z/2016-10-06T00:00:00.000Z]}
2016-10-04T14:47:36,376 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[gb-slo-svb-0187.dunnhumby.co.uk:8091] wrote FAILED status for task [index_kafka_tuk_real_f67568d067497e8_ocooldcn] on [TaskLocation{host='gb-slo-svb-0187.dunnhumby.co.uk', port=8100}]


016-10-04T14:47:34,474 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KafkaIndexTask{id=index_kafka_tuk_real_f67568d067497e8_ocooldcn, type=index_kafka, dataSource=tuk_real}]
com.metamx.common.ISE: Transaction failure publishing segments, aborting
	at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:506) ~[?:?]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]
2016-10-04T14:47:34,480 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_tuk_real_f67568d067497e8_ocooldcn] status changed to [FAILED].
2016-10-04T14:47:34,483 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_kafka_tuk_real_f67568d067497e8_ocooldcn",
  "status" : "FAILED",
  "duration" : 3885581
}

Hey Giri,

Try removing the druid_dataSource table as well and see if that helps.

Hi David,

We encountered the same exception info and then seems got a data loss result.

<1>
The issue https://github.com/druid-io/druid/issues/3600 says it might be a race condition issue of druid, and is there any plan to fix this issue? Before code fixing, is there any workaround?

<2>
Btw, in my understanding, if we set the replica of a task group to 2, there will be two same tasks to run in sync on diff middlemgr, and they will consumer the same kafka data using the same offset(s) at the same time, and they also will generate the same segment, however once one task completes publishing the segment, the another one will abandon its segment. On the contrary, if one failed to publish the segment when encounter above issue, the another still will complete the segment publishing work. Is my understanding on replica right?

Thanks in advance!

在 2016年10月5日星期三 UTC+8上午1:06:56,David Lim写道:

Hey Qiyun,

I raised a PR to fix #3600, you could try patching your local copy and see if that helps: https://github.com/druid-io/druid/pull/3728

Hey Qiyun,

Just to make sure you don’t miss it, Gian responded to your questions in the linked PR.