[druid-user] Segment Publish Failure

Hi All,

We are intermittently observing failures while publishing segments from tasks running for one particular datasource.
Since there are other similar Kafka based ingestion setup on the same cluster and are running pretty well, I currently do not suspect either hardware constraints or misconfiguration to be the root cause for this. It is trying to publish to GCS bucket and other tasks are able to do it.
The stack trace looks like below:

org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]
at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:651) ~[druid-server-0.20.0.jar:0.20.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

2022-04-18T16:30:41,466 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
java.util.concurrent.ExecutionException: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.1.jar:?]
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.1.jar:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:802) ~[druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:267) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:145) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423) [druid-indexing-service-0.20.0.jar:0.20.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
Caused by: org.apache.druid.java.util.common.ISE: Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]
at org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:651) ~[druid-server-0.20.0.jar:0.20.0]
... 4 more

Any pointers to debug this would be much appreciated.

Regards,
Diganta.

Is that the middle manager log? Can you share the failed task’s log?

Yes, this is the middlemanager logs.
I checked for the task logs which I have configured to be uploaded to a GCK bucket but the log file only has this “Finished peon task” in it. Am I missing some configuration?
Below is what I have configured:

druid.indexer.logs.type=google
druid.indexer.logs.bucket=<bucket-name>
druid.indexer.logs.prefix=druid/indexing-logs

The task lists might be gone. druid.indexer.logs.kill might be the other properties to configure.

Hi Mark,

Thanks for the response. I have some more follow-up questions related to the configs.
As per the blog post task logs are expected to be cleaned up after druid.indexer.storage.recentlyFinishedThreshold which has a default of PT24H
However in my case I do not see any logs in the files uploaded to location(which we are using for long-term storage of logs) specified by druid.indexer.logs.type even for tasks that have run in the last 1 hour.
The files are uploaded and only has the line “Finished peon task” in them.
I can however see part of the stack trace under “errorMsg” property of the task report that is uploaded to the location we use for long-term storage.

Having said that, is it somehow mandatory to set the Log Retention Policy ?

PS: The druid user does have requisite permissions for druid.indexer.task.baseTaskDir

Regards,
Diganta.

Hi Diganta,

No, it’s not mandatory to set a retention policy. In looking at your druid.indexer.logs.type comment, though, I wonder about a couple of other configs. By way of example, if one were to use Azure for deep storage, their log configuration should look something like this:

  • druid.indexer.logs.type=azure
  • druid.indexer.logs.container ----- The Azure Blob Store container to write logs to
  • druid.indexer.logs.prefix ----- The path to prepend to logs

Are your task logs pointing to the same location as your deep storage?

Best,

Mark

Hi Mark,

Yes, the task logs are pointing to the same GCS bucket as the deep storage bucket used to store the segments. Only prefix is different for segments and logs.

Hi Diganta,
The logs thing seems odd, but getting back to your original issue. Have you been able to resolve this?
A few other thoughts:
Is there anything different about this ingestion when compared to the others that are not presenting the issue?
Could there be something different about the topic configuration in kafka?
Is the message throughput much larger in this ingestion?

I have not been able to pinpoint the reason of the failure and its still happening intermittently.
All the ingestions are similar and topics are hosted in similar fashion in Kafka as well. Interestingly, this issue have started coming up for other datasources as well.
“Is the message throughput much larger in this ingestion?” - Throughput varies across the topics but since its on non-prod environment messages are not too much. Is there some correlation b/w throughput and segment publish failures ? Maybe I can test it out.