Overlord / Kafka Indexing Service not creating tasks in druid 0.9.2

Hi there,

today we tried to roll the upgrade in our druid cluster from 0.9.1.1 to 0.9.2.

We started with the historical nodes and everything ran just fine.

We continued with the overlord nodes. We didn’t stop the running tasks, since we thought that they would continue running and the overlord would know about them thanks to the metadata info. When we started the new overlord, all tasks got killed (we assume?) and they would never appear again. We did a shutdown of all the supervisors and even upgraded the middleManager nodes. No tasks whatsoever.

We deleted all in the druid_tasks and druid_taskslocks, and nothing.

Checking the overlord logs we cannot see anything. Maybe this is helpful:

2016-12-07 15:16:44,336 INFO (Logger.java:69): Created worker pool with [1] threads for dataSource [topic_name]
2016-12-07 15:16:44,337 INFO (Logger.java:69): Created taskClient with dataSource[topic_name] chatThreads[1] httpTimeout[PT10S] chatRetries[8]
2016-12-07 15:16:44,353 INFO (Logger.java:69): Started KafkaSupervisor[topic_name], first run in [PT5S], with spec: [KafkaSupervisorSpec{…}]

We tried to change the chatRetries inside the tuningConfig of the supervisor spec. We have no idea what’s going on. We can also see some “Connection reset by peer”, but nothing more.

Our current (and working) configuration is having the historical and the middleManager nodes with 0.9.2 and everything else with 0.9.1.1

Any hint on this?

Thank you.

Hey Fede,

You might be hitting this bug: https://github.com/druid-io/druid/pull/3760

To workaround it, try setting ‘workerThreads’ in the Kafka supervisor tuning config to a value greater than the number of tasks that are normally being monitored by that supervisor. From the snippet you posted, it looks like you have taskCount=1 and replicas=1, so set workerThreads to 2.

If that doesn’t fix your issue, could you post the full overlord logs?

Hi David,

we tried that workaround but it didn’t work… I attach the log like you asked.

Thank you,

overlord.log (1.76 MB)

Hey Fede,

Thanks for the logs. Could you post a jstack dump from the overlord?

Hi David, could you please explain a bit how do I get this dump? Thank you,

Hey Fede,

When the overlord appears to be stuck and isn’t creating any new tasks, run ‘jstack -l {overlord pid}’ from the terminal. This will generate a thread dump in the overlord logs which will show if there are any deadlocks or threads waiting to make progress.

For more info: http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html

We just tried again today and it failed the same.

I attach both the overlord log and the jstack.

By the way, we tried to compile the “fixed” kafka-indexing-service (0.9.3-SNAPSHOT version) like you told in another post, but we got a ClassNotFoundException…

Thank you.

jstack_overlord.log (151 KB)

overlord.log (840 KB)

Hey Fede,

I took a look at the stack dump and the logs and unfortunately I can’t see anything immediately wrong with either (you’re not hitting the bug from https://github.com/druid-io/druid/pull/3760). Can you try submitting a very simple test supervisor spec for a new dataSource and pointing it to a new Kafka topic and see if that works?

Hi,

Just a quick notice that we hit the same bug and thus we weren’t able to upgrade to 0.9.2 and will stick to 0.9.1.1 as long as the issue ain’t solved.

Thanks.

Julien

Hey Julien,

Are you hitting the bug patched in https://github.com/druid-io/druid/pull/3760 or having similar issues to Fede? In Fede’s case, the last log entries from the supervisor were:

2016-12-16 09:00:29,442 DEBUG(Logger.java:55): Found [1] Kafka partitions for topic [dataSource_1]
2016-12-16 09:00:29,442 INFO (Logger.java:69): New partition [0] discovered for topic [dataSource_1], added to task group [0]
2016-12-16 09:00:29,447 DEBUG(Logger.java:55): Found [0] Kafka indexing tasks for dataSource [dataSource_1]

In other words, it didn’t actually create any indexing tasks for a reason we haven’t been able to figure out yet.

Hi David, we tried to create a new task with a new kafka topic and same thing happened…

I’ll upload the logs, but you can expect the same behaviour.

Hey Fede,

I’m almost certain you’re hitting this issue: https://github.com/druid-io/druid/issues/3795 as I was able to reproduce the behavior you are seeing when I downgraded to Java 7.

Are you able to do either of the following?:

a) Update your runtime to Java 8
b) Build Druid from master, which includes the fix here: https://github.com/druid-io/druid/pull/3796

Hi David!

We’ll upgrade to Java 8 today and see if it works. I’ll keep you posted.

Thank you!