Druid Real time ingestion issue - Two overlords taking the lead at the same time

Hi all,

I have been running into a strange issue last night with my Druid cluster.

I am using Druid 0.8.1 and tranquility 2.10:0.4.2 + Storm to ingest real time data.

Here is an explanation of the issue and what I think happened:

Everything started with a zookeeper outage, our Druid zk cluster went down for few seconds at 23:26 (2016-03-09).

After zookeeper went back up, it looks like our two overlords took the lead at pretty much the same time (we have two overlords for high availability):

Overlord 1:

2016-03-09T23:31:06,162 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskMaster - By the power of Grayskull, I have the power!

``

Overlord 2:

2016-03-09T23:30:45,875 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskMaster - By the power of Grayskull, I have the power!

``

After that they were both trying to assign my real time tasks to my workers but it looks like when one was trying to assign a task, the other one was killing it right after:

Overlord 1:

2016-03-09T23:55:00,759 INFO [qtp1743770306-33] io.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0 with status: TaskStatus{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, status=RUNNING, duration=-1}
2016-03-09T23:55:00,768 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,768 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,808 INFO [pool-22-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[ip-172-30-103-177.ec2.internal:8080] to add task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0]
2016-03-09T23:55:00,811 INFO [pool-22-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0 switched from pending to running (on [ip-172-30-103-177.ec2.internal:8080])
2016-03-09T23:55:00,822 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] wrote RUNNING status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:01,045 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] wrote FAILED status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:01,045 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] completed task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] with status[FAILED]
2016-03-09T23:55:01,045 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:01,047 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] on worker[ip-172-30-103-177.ec2.internal:8080]
2016-03-09T23:55:01,050 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0 to status: TaskStatus{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, status=FAILED, duration=0}
2016-03-09T23:55:01,057 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, type=index_realtime, dataSource=rtb_bid_requests}
2016-03-09T23:55:01,057 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.TaskQueue - Task FAILED: RealtimeIndexTask{id=index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0, type=index_realtime, dataSource=rtb_bid_requests} (0 run duration)
2016-03-09T23:55:01,057 INFO [Curator-PathChildrenCache-3] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] went bye bye.

``

Overlord 2:

2016-03-09T23:55:00,824 INFO [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] wrote RUNNING status for task: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,824 WARN [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,956 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}
2016-03-09T23:55:00,956 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0! Are you sure the task was running?
2016-03-09T23:55:00,992 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}
2016-03-09T23:55:00,992 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0! Are you sure the task was running?
2016-03-09T23:55:01,059 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}
2016-03-09T23:55:01,059 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0! Are you sure the task was running?
2016-03-09T23:55:01,078 INFO [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0] just disappeared!
2016-03-09T23:55:01,078 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Can’t shutdown! No worker running task index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0

``

As a result all my real time index tasks failed right after they were created. The situation came back to normal when I restarted one of my overlord which was smart enough to not take the lead again.

Here are my questions:

  1. Do you think my understanding of the problem is correct? I have attached my two overlord logs if you want to take a closer look.

  2. These two log statements are weird to me:

2016-03-09T23:55:00,824 WARN [Curator-PathChildrenCache-2] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ip-172-30-103-177.ec2.internal:8080] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0
2016-03-09T23:55:00,956 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ip-172-30-103-177.ec2.internal:8080, status 200 OK, response: {“task”:“index_realtime_rtb_bid_requests_2016-03-10T00:00:00.000-08:00_1_0”}

``

If my understanding is correct it looks like one of my worker announce that it is running a task that my overlord didn’t know about. Right after my overlord asks this same worker to shut down this task. Why would that happen?

  1. Does anyone have experienced the same issue? How can I avoid this problem in the future? I would like to keep two overlords for high availability if possible.

Thanks a lot for your answers.

Guillaume

Archive.zip (2.04 MB)

It can be possible for two overlords to think they’re leading at the same time, if one was disconnected from ZK but hasn’t figured this out yet (because its session hasn’t expired yet).

You should be able to avoid problems here by setting druid.indexer.queue.startDelay to something higher than druid.zk.service.sessionTimeoutMs. The defaults are 1 minute and 30 seconds, which should work. Is it possible you’re overriding one of them?

Hi Gian,

Thanks for your answer. I have double checked and I am not overriding these parameters.

Here is my common conf:

druid.announcer.type=batch
druid.cache.sizeInBytes=1000000000
druid.cache.type=local
druid.curator.compress=true
druid.discovery.curator.path=/prod/discovery
druid.emitter=logging
druid.extensions.coordinates=[“io.druid.extensions:druid-s3-extensions:0.8.1”, “io.druid.extensions:druid-kafka-eight:0.8.1”, “io.druid.extensions:mysql-metadata-storage:0.8.1”]
druid.metadata.storage.connector.connectURI=############################
druid.metadata.storage.connector.password=#####
druid.metadata.storage.connector.user=#########
druid.metadata.storage.connector.useValidationQuery=true
druid.metadata.storage.tables.base=prod
druid.metadata.storage.type=mysql
druid.s3.accessKey=#############
druid.s3.secretKey=#################################
druid.selectors.indexing.serviceName=druid:prod:indexer
druid.storage.baseKey=prod-segments-druid-08
druid.storage.bucket=gumgum-druid
druid.zk.paths.base=/druid/prod
druid.zk.service.host=#########

``

Here is my overlord conf:

druid.host=#########################
druid.indexer.firehoseId.prefix=druid:prod:chat
druid.indexer.logs.s3Bucket=gumgum-druid
druid.indexer.logs.s3Prefix=tasks-logs-druid-08
druid.indexer.logs.type=s3
druid.indexer.runner.compressZnodes=true
druid.indexer.runner.type=remote
druid.indexer.storage.type=metadata
druid.port=8080
druid.selectors.indexing.serviceName=druid:prod:indexer
druid.service=druid/prod/indexer

``

Here are more logs where my overlords successfully established the zk session (with 30000ms timeout) and decide to take the lead:

Overlord 1:

2016-03-09T23:30:23,305 INFO [Curator-LeaderSelector-0-SendThread(172.30.103.98:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 172.30.103.98/172.30.103.98:2181. Will not attempt to authenticate using SASL (unknown error)
2016-03-09T23:30:23,306 INFO [Curator-LeaderSelector-0-SendThread(172.30.103.98:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to 172.30.103.98/172.30.103.98:2181, initiating session
2016-03-09T23:30:23,306 INFO [Curator-LeaderSelector-0-SendThread(172.30.103.98:2181)] org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2016-03-09T23:30:24,332 INFO [Curator-LeaderSelector-0-SendThread(172.30.103.25:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 172.30.103.25/172.30.103.25:2181. Will not attempt to authenticate using SASL (unknown error)
2016-03-09T23:30:24,334 INFO [Curator-LeaderSelector-0-SendThread(172.30.103.25:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to 172.30.103.25/172.30.103.25:2181, initiating session
2016-03-09T23:30:30,656 INFO [Curator-LeaderSelector-0-SendThread(172.30.103.25:2181)] org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2016-03-09T23:30:31,485 INFO [Curator-LeaderSelector-0-SendThread(172.30.103.174:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 172.30.103.174/172.30.103.174:2181. Will not attempt to authenticate using SASL (unknown error)
2016-03-09T23:30:34,636 WARN [qtp1743770306-54] org.apache.curator.ConnectionState - Connection attempt unsuccessful after 36886 (greater than max timeout of 30000). Resetting connection and trying again with a new connection.
2016-03-09T23:30:40,863 INFO [qtp1743770306-54] org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
2016-03-09T23:30:40,863 INFO [qtp1743770306-54] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=172.30.103.174:2181,172.30.103.25:2181,172.30.103.98:2181 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@4d0e54e0
2016-03-09T23:30:40,863 INFO [Curator-LeaderSelector-0-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down
2016-03-09T23:30:40,866 INFO [qtp1743770306-54-SendThread(172.30.103.98:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 172.30.103.98/172.30.103.98:2181. Will not attempt to authenticate using SASL (unknown error)
2016-03-09T23:30:40,866 INFO [qtp1743770306-54-SendThread(172.30.103.98:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to 172.30.103.98/172.30.103.98:2181, initiating session
2016-03-09T23:30:40,870 INFO [qtp1743770306-54-SendThread(172.30.103.98:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server 172.30.103.98/172.30.103.98:2181, sessionid = 0x1535f6f21e80019, negotiated timeout = 30000
2016-03-09T23:30:40,871 INFO [qtp1743770306-54-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2016-03-09T23:30:45,875 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskMaster - By the power of Grayskull, I have the power!
2016-03-09T23:30:46,074 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_realtime_ad_events, dataSource=ad_events, interval=2016-03-09T22:00:00.000-08:00/2016-03-09T23:00:00.000-08:00, version=2016-03-09T22:00:00.098-08:00}, taskIds=}
2016-03-09T23:30:46,074 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskLockbox - Added task[index_realtime_ad_events_2016-03-09T22:00:00.000-08:00_0_0] to TaskLock[index_realtime_ad_events]
2016-03-09T23:30:46,074 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.MetadataTaskStorage - Adding lock on interval[2016-03-09T22:00:00.000-08:00/2016-03-09T23:00:00.000-08:00] version[2016-03-09T22:00:00.098-08:00] for task: index_realtime_ad_events_2016-03-09T22:00:00.000-08:00_0_0

``

Overlord 2:

2016-03-09T23:30:26,033 WARN [Curator-Framework-0] org.apache.curator.ConnectionState - Connection attempt unsuccessful after 30098 (greater than max timeout of 30000). Resetting connection and trying again with a new connection.
2016-03-09T23:30:30,752 INFO [Curator-Framework-0] org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
2016-03-09T23:30:30,752 INFO [Curator-Framework-0] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=172.30.103.174:2181,172.30.103.25:2181,172.30.103.98:2181 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@4d0e54e0
2016-03-09T23:30:30,753 INFO [Curator-Framework-0-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down
2016-03-09T23:30:30,756 INFO [Curator-Framework-0-SendThread(172.30.103.174:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 172.30.103.174/172.30.103.174:2181. Will not attempt to authenticate using SASL (unknown error)
2016-03-09T23:30:40,763 INFO [Curator-Framework-0-SendThread(172.30.103.174:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 10007ms for sessionid 0x0, closing socket connection and attempting reconnect
2016-03-09T23:30:40,864 INFO [Curator-Framework-0-SendThread(172.30.103.25:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 172.30.103.25/172.30.103.25:2181. Will not attempt to authenticate using SASL (unknown error)
2016-03-09T23:30:40,864 INFO [Curator-Framework-0-SendThread(172.30.103.25:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to 172.30.103.25/172.30.103.25:2181, initiating session
2016-03-09T23:30:40,867 INFO [Curator-Framework-0-SendThread(172.30.103.25:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server 172.30.103.25/172.30.103.25:2181, sessionid = 0x3535f6f21f6000e, negotiated timeout = 30000
2016-03-09T23:30:40,868 INFO [Curator-Framework-0-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2016-03-09T23:30:45,867 INFO [Curator-ConnectionStateManager-0] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.overlord.TaskQueue.stop()] on object[io.druid.indexing.overlord.TaskQueue@3eef61f6].
2016-03-09T23:30:45,868 INFO [Curator-ConnectionStateManager-0] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.overlord.RemoteTaskRunner.stop()] on object[io.druid.indexing.overlord.RemoteTaskRunner@34affbab].
2016-03-09T23:30:45,869 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskMaster - Bowing out!
2016-03-09T23:31:06,162 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskMaster - By the power of Grayskull, I have the power!
2016-03-09T23:31:06,209 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_realtime_ad_events, dataSource=ad_events, interval=2016-03-09T22:00:00.000-08:00/2016-03-09T23:00:00.000-08:00, version=2016-03-09T22:00:00.098-08:00}, taskIds=}
2016-03-09T23:31:06,209 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskLockbox - Added task[index_realtime_ad_events_2016-03-09T22:00:00.000-08:00_0_0] to TaskLock[index_realtime_ad_events]
2016-03-09T23:31:06,209 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.MetadataTaskStorage - Adding lock on interval[2016-03-09T22:00:00.000-08:00/2016-03-09T23:00:00.000-08:00] version[2016-03-09T22:00:00.098-08:00] for task: index_realtime_ad_events_2016-03-09T22:00:00.000-08:00_0_0
2016-03-09T23:31:06,215 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskLockbox - Reacquired lock on interval[2016-03-09T22:00:00.000-08:00/2016-03-09T23:00:00.000-08:00] version[2016-03-09T22:00:00.098-08:00] for task: index_realtime_ad_events_2016-03-09T22:00:00.000-08:00_0_0
2016-03-09T23:31:06,215 INFO [Curator-LeaderSelector-0] io.druid.indexing.overlord.TaskLockbox - Task[index_realtime_ad_events_2016-03-09T22:00:00.000-08:00_0_0] already present in TaskLock[index_realtime_ad_events]

``

What does mean having a druid.zk.service.sessionTimeoutMs set to 1min in practice for my overlords? I am not sure I understand the doc for this property and how it would have prevented this issue if it was indeed set to a higher value than the zk session timeout.

How can I check the value of this property? Could not find it in the overlord logs.

Thanks !

Guillaume

No ideas what happened there?

We experienced the same issue.

Actually, I was a bug https://github.com/druid-io/druid/issues/3046

I could not reproduce it on the latest 0.9.2 version, so I hope It fixed now.