Overlord Working Very Slow

Hi,
We started our druid upgrade from druid-0.10.0 to druid-0.11.0

We upgraded co-ordinator and overlord first but the overlord was working very slow in the upgraded version. The http apis for fetching running and completed tasks were running very slow and after sometime they seem to get stuck completely.

Initial thread dump (attached) revealed that blocked threads were waiting on mysql-connections ( since we were using mysql for metadata storage) and all 8 threads in the mysql connection pool were busy. They are all in RUNNABLE state but seems to be stuck ( we have taken the thread dump many times and all the time threads were stuck on either of the 2 stack trace attached ).

It seems like overlord is reading the task payload from db and trying to create its instance with reflection in turn triggering javascript function compile as well. We are not sure if the issue is with classloader or rhino lib ( used to compile javascript functions ). We are using java 1.8.0_131, we have tried upgraded the java version to 1.8.0_191 but it didn’t help. We tried running overlord on a different linux system as well but the behaviour is still the same. we are not sure if this is druid related issue or some java/os issue and we couldn’t find much on internet so far.

Can someone help us figure out the issue?

Java version (1.8.0_131 and 1.8.0_191)

Os : Linux #host 4.12.0-1-amd64 #1 SMP Debian 4.12.6-1 (2017-08-12) x86_64 GNU/Linux

I am attaching the thread dump logs where it got stuck.

Also attaching the overlord config and common config

common config (1.26 KB)

overlord config (1.74 KB)

overlord_thread_dump_1 (12.4 KB)

overlord_thread_dump_2 (13.1 KB)

Hi Sonesh,
Sounds like you tried to isolate out java and OS issue. When you mentioned running on a different linux system, was that the same machine? If this is in testing, would rollback be feasible to 0.10 and see if the behavior is same? Maybe that will help pin down if the issue is specific to the upgrade target version.

Regards,

Robert

Hi Robert,

Q : When you mentioned running on a different linux system, was that the same machine?

It was a different machine but with same OS (just much newer hardware, i didn’t think it would make any difference, just a desperate try).

Q : If this is in testing, would rollback be feasible to 0.10 and see if the behavior is same?

we have reverted it back to 0.10.0 and it works fine in 0.10.0. We read the overlord code for this and it seems if you have tls disabled ( which we have ), these http api return running/waiting tasks from internal work queue instead of from db. This behavior changed in 0.11 where irrespective of tls property, Task object is being created from db. ( however we hae not checked if the same problem appears in 0.10 with tls enabled)

We are not sure how much time javascript compilation is suppossed to take and even if it is time consuming, why threads are stuck in a native method.

Also note that while upgrading we have not following the recommended upgrade order in druid docs, we had to upgrade coordinator first and then overlord ( it was due to the change around service discovery ).

Update: v0.12.0 seems to have deferred javascript compilation until necessary, this might solve the issue for overlord because most probably it will never need to use/run those js functions. But since 0.12 can not be reverted to lower than 0.11 version, we need this upgrade.