Overlord in the indexing service becomes very slow when there is a large number of ingestion tasks

We suspended our ingestion to the production druid cluster for a week while incoming events were buffered. On resume, a large number (over 160K) of ingestion tasks were submitted to the overlord. Overlord became very slow and unresponsive. The processing of druid ingestion tasks became very slow.

Our total worker (peon) capacities over 10 hosts is 80. From the task log, I see that the ingestion tasks are processed. Each task takes about 10 seconds to complete. I see, from the middlemanagers, task log timestamps in clusters of 80, There is a gap of roughly 10 - 20 minutes in between the timestamp clusters as follows:


-rw-r–r-- 1 root root 82746 Jan 6 21:20 event_1069_2016-12-17T09:15:00+2016-12-17T09:30:00_2016-12-19T19:05:16.386978Z.log
-rw-r–r-- 1 root root 82494 Jan 6 21:20 event_1069_2016-12-17T09:45:00+2016-12-17T10:00:00_2016-12-19T19:13:45.270662Z.log
-rw-r–r-- 1 root root 82621 Jan 6 21:20 event_1069_2016-12-17T10:15:00+2016-12-17T10:30:00_2016-12-19T19:10:11.168516Z.log
-rw-r–r-- 1 root root 82749 Jan 6 21:20 event_1069_2016-12-17T09:00:00+2016-12-17T09:15:00_2016-12-19T18:49:49.989994Z.log
-rw-r–r-- 1 root root 82875 Jan 6 21:20 event_1069_2016-12-17T10:30:00+2016-12-17T10:45:00_2016-12-19T18:51:16.587333Z.log
-rw-r–r-- 1 root root 82748 Jan 6 21:20 event_1069_2016-12-17T09:30:00+2016-12-17T09:45:00_2016-12-19T18:53:38.834551Z.log
-rw-r–r-- 1 root root 82877 Jan 6 21:20 event_1069_2016-12-17T10:00:00+2016-12-17T10:15:00_2016-12-19T18:56:32.152256Z.log
-rw-r–r-- 1 root root 51298553 Jan 6 21:21 event_1471_2016-12-19T12:45:00+2016-12-19T13:00:00_2016-12-19T18:49:12.838167Z.log
******** time gap ************
-rw-r–r-- 1 root root 71188 Jan 6 21:39 event_1004_2016-09-07T05:30:00+2016-09-07T05:45:00_2016-12-20T02:50:28.805501Z.log
-rw-r–r-- 1 root root 111853 Jan 6 21:39 event_1069_2016-12-16T04:15:00+2016-12-16T04:30:00_2016-12-19T18:50:04.955258Z.log
-rw-r–r-- 1 root root 112028 Jan 6 21:39 event_1069_2016-12-16T13:15:00+2016-12-16T13:30:00_2016-12-19T18:57:00.324180Z.log
-rw-r–r-- 1 root root 111866 Jan 6 21:39 event_1069_2016-12-16T13:00:00+2016-12-16T13:15:00_2016-12-19T18:51:59.700731Z.log
-rw-r–r-- 1 root root 111857 Jan 6 21:39 event_1069_2016-12-16T01:30:00+2016-12-16T01:45:00_2016-12-19T18:56:33.236515Z.log
-rw-r–r-- 1 root root 111857 Jan 6 21:39 event_1069_2016-12-16T03:45:00+2016-12-16T04:00:00_2016-12-19T18:51:16.163971Z.log
-rw-r–r-- 1 root root 1407226 Jan 6 21:39 event_1471_2016-12-19T13:45:00+2016-12-19T14:00:00_2016-12-19T18:51:35.465574Z.log
-rw-r–r-- 1 root root 112018 Jan 6 21:39 event_1069_2016-12-16T03:15:00+2016-12-16T03:30:00_2016-12-19T18:53:39.418384Z.log

It appears that overlord may be very slow assigning the ingestion tasks. In the overlord.log, I see thousands of lines like the following:


2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1471_2016-12-18T08:30:00+2016-12-18T08:45:00_2016-12-19T19:05:57.945164Z]: LockTryAcquireAction{interval=2016-12-18T08:30:00.000Z/2016-12-18T08:45:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1069_2016-12-17T17:00:00+2016-12-17T17:15:00_2016-12-19T19:05:37.461653Z]: LockTryAcquireAction{interval=2016-12-17T17:00:00.000Z/2016-12-17T17:15:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1069_2016-12-16T14:00:00+2016-12-16T14:15:00_2016-12-19T18:57:43.911867Z]: LockTryAcquireAction{interval=2016-12-16T14:00:00.000Z/2016-12-16T14:15:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1069_2016-12-18T10:30:00+2016-12-18T10:45:00_2016-12-19T19:04:25.133042Z]: LockTryAcquireAction{interval=2016-12-18T10:30:00.000Z/2016-12-18T10:45:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1069_2016-12-17T14:00:00+2016-12-17T14:15:00_2016-12-19T19:10:40.580654Z]: LockTryAcquireAction{interval=2016-12-17T14:00:00.000Z/2016-12-17T14:15:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1069_2016-12-17T16:00:00+2016-12-17T16:15:00_2016-12-19T19:13:45.835202Z]: LockTryAcquireAction{interval=2016-12-17T16:00:00.000Z/2016-12-17T16:15:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1069_2016-12-18T04:15:00+2016-12-18T04:30:00_2016-12-19T19:14:53.442265Z]: LockTryAcquireAction{interval=2016-12-18T04:15:00.000Z/2016-12-18T04:30:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1069_2016-12-16T00:45:00+2016-12-16T01:00:00_2016-12-19T19:05:15.762107Z]: LockTryAcquireAction{interval=2016-12-16T00:45:00.000Z/2016-12-16T01:00:00.000Z}
2017-01-06 22:20:15,943 INFO i.d.i.c.a.LocalTaskActionClient [TaskQueue-Manager] Performing action for task[event_1471_2016-12-17T15:00:00+2016-12-17T15:15:00_2016-12-19T18:54:58.635546Z]: LockTryAcquireAction{interval=2016-12-17T15:00:00.000Z/2016-12-17T15:15:00.000Z}

Is overlord not being responsive while busy looking through all the tasks? Is there any configuration change we can make to reduce the time gap in the time gap so that the peons are busy all the time?

We are running imply-1.3.0 with druid 0.9.1.1. If there is any information you need, please let me know.

Thanks.

  • Wayne