Middle manager realtime task keep running

Hi Druid experts,

The task “index_realtime_MiddletiersBillingProd_2015-12-22T19:00:00.000+08:00_0_0” keeps running till now, it supposed to be finished hours ago.

There are only two “exception” , I’m not sure what’s cause of the exceptions.

2015-12-22T11:00:09,407 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.curator.discovery.ServerDiscoverySelector.start() throws java.lang.Exception] on object[io.druid.curator.discovery.ServerDiscoverySelector@5316c9ca].

2015-12-22T11:00:10,541 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking start method[public void io.druid.client.ServerInventoryView.start() throws java.lang.Exception] on object[io.druid.client.BatchServerInventoryView@2e806493].


Could you help to figure out the root casue.

Here is the task log file:

https://drive.google.com/file/d/0B-ISRPi3rU5cTEZ1LWNYT2ZkTjQ/view?usp=sharing

As far as I can tell, it appears the merge was still happening, which is one of the classic signs that more partitions may be required as the merging time for a single shard/partition is taking too long.

The logs starting at 2015-12-22T13:00:00,286 indicate that merging has began. Does the task eventually complete given more time?

Gian knows better than I do, but merging 655 intermediate segment files per hour seems fairly large and I suspect that is what is happening. FWIW, the logs also seem to stop 17 mins or so after the segment even began, so I think given a bit more time the merge and handoff will complete. Recall that handoff wont start until after the windowPeriod has passed.

It’s still running till now, over 15 hours .

Latest log:

在 2015年12月23日星期三 UTC+8上午9:02:01,Fangjin Yang写道:

Yikes. What version of Druid? How much data per second and how many partitions right now?

Also, do any other tasks have this problem?

Hi Xuehui, it appears in your ingestion pipeline the merge is starting to become the bottleneck. The most immediate work around is to create more partitions in tranquility. You can also try 0.8.3, which has more optimizations for merging. There are several more merge optimizations merged into master and undergoing code review that you can pull into your deploy.

Other things to try include increasing the maxRowsInMemory such that less persists occur and less merging needs to be done. Disk performance can also be a factor if you are not using SSDs.

Please let me know what you try and if it helps.

– FJ

Thank you, Fangjin. Will try to upgrade to 0.8.3.

The druid version is 0.8.0, and only 1 partiton in tranquility app configuration.

But now, how to deal with those hanging tasks ? If I kill the corresponding middle manager node to update config, the segments of these tasks will be lost, won’t be persisted into cluster.

在 2015年12月23日星期三 UTC+8上午11:17:35,Fangjin Yang写道:

Hi Xuehui, if you store a copy of your raw data somewhere, you can always reindex things via batch ingestion.