Frequently kafka supervisors are geting into UNHEALTHY state

Hello druid-team,

We are using kafka for realtime data streaming ingestion to our data sources. But one thing we observed is that sometimes kafka index tasks are going into idle state. Neither new kafka index tasks are getting created nor existing task is getting terminated.

And when we check respective kafka supervisor status, it is in UNHEALTHY SUPERVISOR state, having both actively running task and publishing task. The listed actively running task’s remainingSeconds field value is 0, but not able to close by itself. And the listed publishing task which got created two days back, whose remainingSeconds field value is also 0, is still getting listed in publishing tasks.

Thinking that old publishing task is not allowing the existing active task to get completed, i tried to forcefully kill this publishing task. But I was getting Task does not exist error.

When checked in Overlord logs, frequent exceptions are being thrown for the respective supervisor telling that some data is missing.

But after restarting the supervisor, tasks are getting into normal state.

This is happening once in a while. Not able to find the cause also.

Please someone can tell that why we are running into this problem once in a while.

Regards,
Roopini

How do the individual task logs look? Maybe the tasks are in the handoff / push state for a long time for some reason (maybe unable to reach Deep Storage??) which makes them pause for a long while… That is where I would start anyway :slight_smile:

Not any specific error logs in the active running task. Even it’s task period of 1 hour got over, it was consuming kafka data with progressive offsets, running for more than a day, without getting completed by itself.

We are using file system for deep storage.

what are the sequence of things that will happen when the task is in publishing state?

What could be the reasons that tasks might go into pause mode?

Before publishing got finished, task status will get changed to success? or middle manager waits till the task published its segemnts and then change its status.

When you say “task period”, do you mean taskDuration in the ingestion spec?

Yes Peter. It is “taskDuration”:“PT3600S” in ioconfig.

Hey Roopini - sorry to be late to reply I have been (trying to be!!!) on holiday this last week.

Did you find the root cause?