Hi all,
I have been running into a strange issue last night with my Druid cluster.
I am using Druid 0.8.1 and tranquility 2.10:0.4.2 + Storm to ingest real time data.
Here is an explanation of the issue and what I think happened:
Everything started with a zookeeper outage, our Druid zk cluster went down for few seconds at 23:26 (2016-03-09).
After zookeeper went back up, it looks like our two overlords took the lead at pretty much the same time (we have two overlords for high availability):
Overlord 1:
2016-03-09T23:31:06,162 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskMaster - By the power of Grayskull, I have the power!
``
Overlord 2:
2016-03-09T23:30:45,875 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskMaster - By the power of Grayskull, I have the power!
``
After that they were both trying to assign my real time tasks to my workers but it looks like when one was trying to assign a task, the other one was killing it right after:
Overlord 1:
2016-03-09T23:55:00,759 INFO [qtp1743770306-33] io.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0 with status: TaskStatus{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, status=RUNNING, duration=-1}
2016-03-09T23:55:00,768 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,768 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,808 INFO [pool-22-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[ip-172-30-103-177.ec2.internal:8080] to add task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0]
2016-03-09T23:55:00,811 INFO [pool-22-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0 switched from pending to running (on [ip-172-30-103-177.ec2.internal:8080])
2016-03-09T23:55:00,822 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] wrote RUNNING status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:01,045 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] wrote FAILED status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:01,045 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] completed task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] with status[FAILED]
2016-03-09T23:55:01,045 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:01,047 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] on worker[ip-172-30-103-177.ec2.internal:8080]
2016-03-09T23:55:01,050 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0 to status: TaskStatus{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, status=FAILED, duration=0}
2016-03-09T23:55:01,057 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, type=index_realtime, dataSource=rtb_bid_requests}
2016-03-09T23:55:01,057 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.TaskQueue - Task FAILED: RealtimeIndexTask{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, type=index_realtime, dataSource=rtb_bid_requests} (0 run duration)
2016-03-09T23:55:01,057 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] went bye bye.
``
Overlord 2:
2016-03-09T23:55:00,824 INFO [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] wrote RUNNING status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,824 WARN [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,956 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}
2016-03-09T23:55:00,956 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0! Are you sure the task was running?
2016-03-09T23:55:00,992 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}
2016-03-09T23:55:00,992 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0! Are you sure the task was running?
2016-03-09T23:55:01,059 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}
2016-03-09T23:55:01,059 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0! Are you sure the task was running?
2016-03-09T23:55:01,078 INFO [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] just disappeared!
2016-03-09T23:55:01,078 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Can’t shutdown! No worker running task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
``
As a result all my real time index tasks failed right after they were created. The situation came back to normal when I restarted one of my overlord which was smart enough to not take the lead again.
Here are my questions:
-
Do you think my understanding of the problem is correct? I have attached my two overlord logs if you want to take a closer look.
-
These two log statements are weird to me:
2016-03-09T23:55:00,824 WARN [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,956 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}
``
If my understanding is correct it looks like one of my worker announce that it is running a task that my overlord didn’t know about. Right after my overlord asks this same worker to shut down this task. Why would that happen?
- Does anyone have experienced the same issue? How can I avoid this problem in the future? I would like to keep two overlords for high availability if possible.
Thanks a lot for your answers.
Guillaume
Archive.zip (2.04 MB)