[kafka-indexing-service] Transaction failure publishing segments, aborting

Hi

I suffered same problem with https://groups.google.com/forum/#!topic/druid-user/7Tl53VovABA in Druid 0.9.2 so I versioned up my environment to 0.10.0, because https://github.com/druid-io/druid/pull/3728 is included in 0.10.0.

but my problem is not solved until now. Please help to find the cause of this problem.

failed task log

2017-06-29T10:04:10,994 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KafkaIndexTask{id=index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk, type=index_kafka, dataSource=ims-snapshot}]
io.druid.java.util.common.ISE: Transaction failure publishing segments, aborting
        at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:517) ~[?:?]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.10.0.jar:0.10.0]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.10.0.jar:0.10.0]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
2017-06-29T10:04:10,995 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] status changed to [FAILED].
2017-06-29T10:04:10,996 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk",
  "status" : "FAILED",
  "duration" : 3609518
}

overlord log

2017-06-29T01:04:10,823 INFO [qtp661440732-81] io.druid.indexing.overlord.MetadataTaskStorage - Logging action for task[index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk]: SegmentInsertAction{segments=[DataSegment{size=4076695, shardSpec=NumberedShardSpec{partitionNum=0, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/0_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}, DataSegment{size=4089809, shardSpec=NumberedShardSpec{partitionNum=6, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/6_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}, DataSegment{size=4068097, shardSpec=NumberedShardSpec{partitionNum=5, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/5_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}, DataSegment{size=4107315, shardSpec=NumberedShardSpec{partitionNum=4, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/4_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}, DataSegment{size=4038584, shardSpec=NumberedShardSpec{partitionNum=3, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/3_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}, DataSegment{size=4108072, shardSpec=NumberedShardSpec{partitionNum=7, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/7_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}, DataSegment{size=4129711, shardSpec=NumberedShardSpec{partitionNum=2, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/2_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}, DataSegment{size=4103575, shardSpec=NumberedShardSpec{partitionNum=1, partitions=0}, metrics=[count], dimensions=[barcode, finance_type, asset_code, asset_state, serial, spec_summary_name, cal_date, purchase_date, purchase_amount, guarantee_year, network_type, asset_type, asset_vendor, asset_model, erp_asset_depamt, erp_asset_namt, erp_asset_depreciation, erp_asset_status, erp_asset_life, erp_asset_use_life, erp_asset_nuse_life, system_type, sub_system_type, system_id, hostname, domain, loc_b, loc_f, loc_x, loc_y, loc_z, management_ip, ip, svc_id], version='2017-06-29T00:41:56.832Z', loadSpec={type=hdfs, path=hdfs://doopey/druidtest/ssd-test/segments/ims-snapshot/20170627T150000.000Z_20170627T160000.000Z/2017-06-29T00_41_56.832Z/1_index.zip}, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, dataSource='ims-snapshot', binaryVersion='9'}], startMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic='ims_snapshot', partitionOffsetMap={0=6115916, 1=6113677, 2=6115956, 3=6112567, 4=6119125, 5=6114555, 6=6116491, 7=6118662}}}, endMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic='ims_snapshot', partitionOffsetMap={0=6133986, 1=6132056, 2=6134271, 3=6130714, 4=6137448, 5=6132693, 6=6134688, 7=6137036}}}}
2017-06-29T01:04:10,830 INFO [qtp661440732-81] io.druid.metadata.IndexerSQLMetadataStorageCoordinator - Not updating metadata, existing state is not the expected start state.
2017-06-29T01:04:10,868 INFO [qtp661440732-80] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk]: SegmentListUsedAction{dataSource='ims-snapshot', intervals=[2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z]}
2017-06-29T01:04:11,869 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[hbase-md16c1-005.dakao.io:8091] wrote FAILED status for task [index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] on [TaskLocation{host='hbase-md16c1-005.dakao.io', port=8101}]
2017-06-29T01:04:11,869 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[hbase-md16c1-005.dakao.io:8091] completed task[index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] with status[FAILED]
2017-06-29T01:04:11,870 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk
2017-06-29T01:04:11,870 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] on worker[hbase-md16c1-005.dakao.io:8091]
2017-06-29T01:04:11,872 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] from activeTasks
2017-06-29T01:04:11,872 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] from TaskLock[index_kafka_ims-snapshot]
2017-06-29T01:04:11,872 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskLockbox - TaskLock is now empty: TaskLock{groupId=index_kafka_ims-snapshot, dataSource=ims-snapshot, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, version=2017-06-29T00:41:56.832Z}
2017-06-29T01:04:11,874 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.MetadataTaskStorage - Deleting TaskLock with id[28396]: TaskLock{groupId=index_kafka_ims-snapshot, dataSource=ims-snapshot, interval=2017-06-27T15:00:00.000Z/2017-06-27T16:00:00.000Z, version=2017-06-29T00:41:56.832Z}
2017-06-29T01:04:11,878 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk to status: TaskStatus{id=index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk, status=FAILED, duration=3615967}
2017-06-29T01:04:11,881 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Task done: KafkaIndexTask{id=index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk, type=index_kafka, dataSource=ims-snapshot}
2017-06-29T01:04:11,881 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Task FAILED: KafkaIndexTask{id=index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk, type=index_kafka, dataSource=ims-snapshot} (3615967 run duration)
2017-06-29T01:04:11,881 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] status changed to [FAILED].
2017-06-29T01:04:11,882 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_ims-snapshot_46bd2ffae454391_nkgfjgnk] went bye bye.

Hi,

I got the exact same error, is there a solution for this?

Any chance this is related to Kafka indexing jobs being stalled (since no worker is available) / running out of order?

Everything was working fine for few weeks, until we recently increased the load on the system significantly by running

migration tasks for historical data (using Hadoop).

Thanks

Eran

the payload partition offset is not equals with value in mysql db

在 2017年6月29日星期四 UTC+8下午12:23:43,Dongkyu Hwangbo写道:

It’s been already solved problem, but I forgot writing the cause of this problem.
The cause was same with last reply of this thread. I can fix this problem with remove row that included offset info in metadata db.

But, I have no idea why this problem was occured.

2017년 7월 28일 금요일 오후 9시 51분 19초 UTC+9, hellobab...@gmail.com 님의 말: