Druid kafka index tasks are failing sometimes

Hello Druid-team,

Sometimes kafka index tasks are failing with the below error log.

2020-09-01T17:34:08,475 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task. java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!] at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500) ~[guava-20.0.jar:?] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479) ~[guava-20.0.jar:?] at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76) ~[guava-20.0.jar:?] at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:826) ~[druid-indexing-service-0.18.0.jar:0.18.0] at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:276) [druid-indexing-service-0.18.0.jar:0.18.0] at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:164) [druid-indexing-service-0.18.0.jar:0.18.0] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:421) [druid-indexing-service-0.18.0.jar:0.18.0] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:393) [druid-indexing-service-0.18.0.jar:0.18.0] at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111) [guava-20.0.jar:?] at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58) [guava-20.0.jar:?] at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75) [guava-20.0.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265] Caused by: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!] at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:646) ~[druid-server-0.18.0.jar:0.18.0] … 6 more 2020-09-01T17:34:08,511 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: { “id” : “index_kafka_XXXXXXX_XXXXXX_klimopcb”, “status” : “FAILED”, “duration” : 3602277, “errorMsg” : “java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish se…”, “location” : { “host” : null, “port” : -1, “tlsPort” : -1 } }

didnt find any error logs regarding the above error in both overlord and middlemanager logs.

So Can anyone please tell us, why there errors are occuring in kafka index tasks.

Regards,
Roopini.

Even I faced these issues. It is either due to offset mismatch or data schema mismatch in Kafka and druid.

You hard rest druid supervisor once if it’s offset mismatch.

This should resolve it

Regards,
Poonam

Hello,

Even though tasks are failing, currentOffsets are getting progressed. With failed tasks we dont know whether segemnts got persisted to deep storage or not.

If we do hard reset of supervisor, it may result in either duplication of data or loss of data according to the useEarliestOffset field in io config.

I dont think hard reset is the alternative for solving this problem.

Need to know exactly at which scenario this kind of error will come.

Regards,
Roopini.

Until the segments are in memory of running tasks, data is available for querying.

But since the tasks are failing while publishing data to deep storage and are completed with FAILED status, after that data is available nowhere in the system.

If tasks are failing while publishing the segments, kafka offsets should not get progressed.

In our case:

New tasks are getting started with new set of offsets, not with the failed task offsets, which is getting resulted in data loss.

were you able to resolve your issue with kafka indexing? If not, I can help you.

HI hi - just following up from Matt as well - did you get the root cause identified?

Would be interested to know if segments actually got put into Deep Storage?