Upgrade 0.6.173 to 0.8.2 real time (fed via tranquility) issue

Hey All,

I’m in the process of upgrading an active production cluster from 0.6.173 to 0.8.2. I have two real time streams one with 8 partitions and a replica and another with 2 partitions and replica. Both end up showing the issue below.

The state:

  • historical nodes upgraded to 0.8.2

  • middle managers upgraded to 0.8.2

I was in the process of upgrading the overlord to 0.8.2. I had completed the process but then we noticed that some real time workers died with this exception:

{class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class org.apache.hadoop.ipc.RemoteException, exceptionMessage=Lease mismatch on /user/bon/druid/summary/20160113T050000.000Z_20160113T060000.000Z/2016-01-13T05_02_51.044Z/4/index.zip owned by DFSClient_NONMAPREDUCE_-1401441023_1 but is accessed by DFSClient_NONMAPREDUCE_-739720424_1

So to me it looks like with some partitions it looks like it gets confused which of the two workers is going to write the segment to HDFS. They both try and one fails. As far as I can tell I don’t loose any data since only one of the two workers fails (either the primary or replica). I backed out of my upgrade on the overlord and the middle managers and the problems seems to have disappeared. This issue happens on a variable number of partitions…some partitions complete without errors and I did not see this in testing all be it at a much smaller volume

The middle managers do have a recompiled version of druid with an older version of jackson and guava for hadoop compatibility.

Any thoughts on where I should look for issues? zookeeper version ? or is it just a 0.6.173 <-> 0.8.2 issue? (I did not have the 0.8.2 overlord running for too long and we have three hours of active real time processes to support long standing updates so I cannot guarantee that the 0.8.2 middle manager jobs were started with a 0.8.2 overlord)

Thanks!

Mark

Hey Mark,

I guess you’re using HDFS deep storage? I’m wondering why this would be connected to the upgrade. Even beforehand, if you have 2 replicants they should both be writing the same segment to deep storage at potentially the same time.

Is it possible this is something that happened sometimes before the upgrade too?

I also wonder if anyone else using tranquility + HDFS deep storage has similar problems?

Yes on the HDFS deep storage.

Nope, the error only happens on the 0.8.2 code base. I rolled back to 0.6.173 and the errors went away (the jobs show up as FAILED on the overlord console so its quite clear when it happens)

Both the first and second worker write to HDFS? I assumed that there was some form of leader election that made only one worker push the segment partition to deep storage/HDFS and that there was some race condition that made both think they were the leader.

I have changed the underlying hadoop library, which I did not have to do with 0.6.173. The jackson incompatibility made hadoop jobs break when a task was submitted.

I do have one other question: is it best practice to upgrade the overlord before the middle managers? I have been upgrading middle managers first and waiting for the last old middle manager to exit before upgrading the overlord.

The docs just say “upgrade indexing service” usually, so I am a bit unclear as to the order in this case. Perhaps this is my problem

Hey Mark,

Usually folks update their overlords before their middleManagers, although that wouldn’t cause what you’re seeing.

I talked to some other folks using HDFS deep storage and they said they see this from time to time during normal operation. It’s kind of okay, in that the “winner” still has generally pushed a segment and will monitor handoff, but it would be better to avoid failed tasks during normal operation. Also, it is possible that the “winner” will die before finishing handoff, and in that case we’d prefer to have the “loser” have the opportunity to finish the job. So I filed this github issue to track it. https://github.com/druid-io/druid/issues/2278

I’m not sure why you didn’t see any of these before upgrading- it’s possible an older version of HDFS deep storage was set to overwrite files, or that it used a different version of the Hadoop client that behaved differently. It’s also possible that there is some timing thing that made your tasks pre-upgrade less likely to try to push at the same time.

Thanks Gian,

I was seeing it basically every hour but with variable number of partitions impacted. Given what you’ve said we are going to change our alerting so we generate a page only if all of the workers in a partition have failed so we can proceed with the upgrade.

Perhaps I am getting a higher number of failures because I am running more workers per middle manager than what you show in the production configuration documents to keep costs down.