Druid 0.16 ingestion tasks timeout issue with KafkaIndexingService

Hi,

We recently upgraded the version of Druid to 0.16 and since then we have been having quite a few issues with Druid ingestion using the Kafka indexing service. There are frequent errors due to unique key constraint violation with the pending segments table. Also, there are timeouts with current running ingestion task. The ingestion is totally stalled and after 3 task failures, the ingestion stops all together. Any help would be much appreciated.

org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Exception submitting action for task[index_kafka_telemetry-events_548e525569de710_bebmfiod]

org.apache.druid.java.util.common.IOE: Error with status[500 Internal Server Error] and message[{“error”:“org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint “druid_pendingsegments_pkey”\n Detail: Key (id)=(telemetry-events_2019-07-08T00:00:00.000Z_2019-07-09T00:00:00.000Z_2019-07-08T00:00:00.441Z_864) already exists. [statement:“INSERT INTO druid_pendingSegments (id, dataSource, created_date, start, \“end\”, sequence_name, sequence_prev_id, sequence_name_prev_id_sha1, payload) VALUES (:id, :dataSource, :created_date, :start, :end, :sequence_name, :sequence_prev_id, :sequence_name_prev_id_sha1, :payload)”, located:“INSERT INTO druid_pendingSegments (id, dataSource, created_date, start, \“end\”, sequence_name, sequence_prev_id, sequence_name_prev_id_sha1, payload) VALUES (:id, :dataSource, :created_date, :start, :end, :sequence_name, :sequence_prev_id, :sequence_name_prev_id_sha1, :payload)”, rewritten:“INSERT INTO druid_pendingSegments (id, dataSource, created_date, start, “end”, sequence_name, sequence_prev_id, sequence_name_prev_id_sha1, payload) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)”, arguments:{ positional:{}, named:{sequence_prev_id:’’,payload:[123, 34, 100, 97, 116, 97, 83, 111, 117, 114, 99, 101, 34, 58, 34, 116, 101, 108, 101, 109, 101, 116, 114, 121, 45, 101, 118, 101, 110, 116, 115, 34, 44, 34, 105, 110, 116, 101, 114, 118, 97, 108, 34, 58, 34, 50, 48, 49, 57, 45, 48, 55, 45, 48, 56, 84, 48, 48, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 47, 50, 48, 49, 57, 45, 48, 55, 45, 48, 57, 84, 48, 48, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 34, 44, 34, 118, 101, 114, 115, 105, 111, 110, 34, 58, 34, 50, 48, 49, 57, 45, 48, 55, 45, 48, 56, 84, 48, 48, 58, 48, 48, 58, 48, 48, 46, 52, 52, 49, 90, 34, 44, 34, 115, 104, 97, 114, 100, 83, 112, 101, 99, 34, 58, 123, 34, 116, 121, 112, 101, 34, 58, 34, 110, 117, 109, 98, 101, 114, 101, 100, 34, 44, 34, 112, 97, 114, 116, 105, 116, 105, 111, 110, 78, 117, 109, 34, 58, 56, 54, 52, 44, 34, 112, 97, 114, 116, 105, 116, 105, 111, 110, 115, 34, 58, 48, 125, 125],start:‘2019-07-08T00:00:00.000Z’,end:‘2019-07-09T00:00:00.000Z’,id:‘telemetry-events_2019-07-08T00:00:00.000Z_2019-07-09T00:00:00.000Z_2019-07-08T00:00:00.441Z_864’,created_date:‘2019-11-19T18:26:36.279Z’,dataSource:‘telemetry-events’,sequence_name:‘index_kafka_telemetry-events_548e525569de710_0’,sequence_name_prev_id_sha1:‘3A369E4FC7F9B72A9212E3B1C7D9B342844914AA’}, finder:}]”}]. Check overlord logs for details.

at org.apache.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.appenderator.ActionBasedSegmentAllocator.allocate(ActionBasedSegmentAllocator.java:55) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.getSegment(BaseAppenderatorDriver.java:334) [druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.append(BaseAppenderatorDriver.java:402) [druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.add(StreamAppenderatorDriver.java:188) [druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:626) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:259) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:177) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

2019-11-19T18:26:36,296 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Will try again in [PT53.314S].

2019-11-19T18:27:29,611 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_kafka_telemetry-events_548e525569de710_bebmfiod] to overlord: [SegmentAllocateAction{dataSource=‘telemetry-events’, timestamp=2019-07-08T04:10:54.101Z, queryGranularity=NoneGranularity, preferredSegmentGranularity={type=period, period=P1D, timeZone=UTC, origin=null}, sequenceName=‘index_kafka_telemetry-events_548e525569de710_0’, previousSegmentId=‘telemetry-events_2019-07-06T00:00:00.000Z_2019-07-07T00:00:00.000Z_2019-07-06T00:00:01.275Z_941’, skipSegmentLineageCheck=true, shardSpecFactory=org.apache.druid.timeline.partition.NumberedShardSpecFactory@5d138671, lockGranularity=SEGMENT}].

2019-11-19T18:27:29,649 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Exception submitting action for task[index_kafka_telemetry-events_548e525569de710_bebmfiod]

org.apache.druid.java.util.common.IOE: Error with status[500 Internal Server Error] and message[{“error”:“Unable to grant lock to inactive Task [index_kafka_telemetry-events_548e525569de710_bebmfiod]”}]. Check overlord logs for details.

at org.apache.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:94) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.appenderator.ActionBasedSegmentAllocator.allocate(ActionBasedSegmentAllocator.java:55) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.getSegment(BaseAppenderatorDriver.java:334) [druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.append(BaseAppenderatorDriver.java:402) [druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.add(StreamAppenderatorDriver.java:188) [druid-server-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:626) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:259) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:177) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

Regards,

-Anand

Hi Anand,
Can we test clearing druid_pendingSegments table and retrying the ingestion?

I also see ---- error":"Unable to grant lock to inactive Task [index_kafka_

You might have to clear druid_taskLocks table also.

See whether you have restore,json file somewhere in your log paths . Try removing that restore.json file to get rid of — error":"Unable to grant lock to inactive Task [index_kafka_

Thanks,

–siva