We started our druid upgrade from druid-0.10.0 to druid-0.11.0
We upgraded co-ordinator and overlord first but the overlord was working very slow in the upgraded version. The http apis for fetching running and completed tasks were running very slow and after sometime they seem to get stuck completely.
Initial thread dump (attached) revealed that blocked threads were waiting on mysql-connections ( since we were using mysql for metadata storage) and all 8 threads in the mysql connection pool were busy. They are all in RUNNABLE state but seems to be stuck ( we have taken the thread dump many times and all the time threads were stuck on either of the 2 stack trace attached ).
Can someone help us figure out the issue?
Java version (1.8.0_131 and 1.8.0_191)
Os : Linux #host 4.12.0-1-amd64 #1 SMP Debian 4.12.6-1 (2017-08-12) x86_64 GNU/Linux
I am attaching the thread dump logs where it got stuck.
Also attaching the overlord config and common config
common config (1.26 KB)
overlord config (1.74 KB)
overlord_thread_dump_1 (12.4 KB)
overlord_thread_dump_2 (13.1 KB)
Sounds like you tried to isolate out java and OS issue. When you mentioned running on a different linux system, was that the same machine? If this is in testing, would rollback be feasible to 0.10 and see if the behavior is same? Maybe that will help pin down if the issue is specific to the upgrade target version.
Q : When you mentioned running on a different linux system, was that the same machine?
It was a different machine but with same OS (just much newer hardware, i didn’t think it would make any difference, just a desperate try).
Q : If this is in testing, would rollback be feasible to 0.10 and see if the behavior is same?
we have reverted it back to 0.10.0 and it works fine in 0.10.0. We read the overlord code for this and it seems if you have tls disabled ( which we have ), these http api return running/waiting tasks from internal work queue instead of from db. This behavior changed in 0.11 where irrespective of tls property, Task object is being created from db. ( however we hae not checked if the same problem appears in 0.10 with tls enabled)
Also note that while upgrading we have not following the recommended upgrade order in druid docs, we had to upgrade coordinator first and then overlord ( it was due to the change around service discovery ).