Upgrading cluster from 0.10.1 to 0.12

Hi all,

I am trying to upgrade a cluster from 0.10.1 to 0.12.0 and have run into a problem with the lookups feature.

I have started with one historical node and upgraded it from 0.10.1 to 0.12.0. All went smooth except for lookups not working on queries sent against it.

I checked the coordinator (which is still on 0.10.1) log and found this exception:

2018-03-15T14:19:23,620 ERROR [LookupCoordinatorManager–7] io.druid.server.lookup.cache.LookupCoordinatorManager - Failed to finish lookup management loop.: {class=io.druid.server.lookup.cache.LookupCoordinatorManager, exceptionType=class java.lang.IllegalStateException, exceptionMessage=null}


at com.google.common.base.Preconditions.checkState(Preconditions.java:161) ~[guava-16.0.1.jar:?]

at com.google.common.net.HostAndPort.getPort(HostAndPort.java:110) ~[guava-16.0.1.jar:?]

at io.druid.server.lookup.cache.LookupCoordinatorManager.lookupManagementLoop(LookupCoordinatorManager.java:517) ~[druid-server-0.10.1.jar:0.10.1]

at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:582) [guava-16.0.1.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_151]

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_151]

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_151]

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_151]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]


I went further and checked zookeeper and noticed that the upgraded node shows up as http:historical-hostname-X:8083 under /druid/listeners/lookups/__default while everything else shows up as historical-hostname-Y:8083. It seems that this pull request https://github.com/druid-io/druid/pull/4270 that was included in 0.11.0 release has changed the format of the hosts to add the scheme.

I could not find anything about this in the 0.11.0 or 0.12.0 release notes. Is there an upgrade path from 0.10.1 to 0.12.0 without incurring downtime on the query nodes (ingestion can be delayed, we are running batch jobs every few minutes so those can be paused for a while)?



In general, we guarantee no-downtime upgrade paths between individual
"major" releases (i.e. 0.10.x -> 0.11.x), but skipping a major release
(0.10.x -> 0.12.x) is not something that we guarantee. Specifically,
the way we deprecate things and eventually remove them is introduce
the new thing in the next major (have it running side-by-side with the
old) and then in the subsequent major we eliminate the old. So, if
you jump majors, backwards incompatible changes can arise.

I just re-read the docs at
http://druid.io/docs/latest/operations/rolling-updates.html and this
is definitely something that could be spelled out better.

Your "supported" options at this point would be to hop through 0.11 on
your way to 0.12 or take a downtime.

You could explore unsupported options which might or might not work as
well. For one, if lookups aren't vital to your primary query load,
you could take the hit on lookups until your cluster is updated. You
could also try updating the coordinator first and then the other nodes
(this might introduce other incompatibilities, but it also might "just