Middle Manager Node Failure

Hi,
In case of Middle Manager Node failure, How fault taulrency is handled with respect to Real-time ingestion task(Kafka ingestion) running on that Node?

Regards,
Johny Nainwani

Hi,
Would somebody help in resolving my query ?

Regards,
Johny Nainwani

You can run real-time tasks with replication factor of 2 or more.

@Gaurav I believe even if we haven’t setup replication factor of real-time tasks to 2, a middle manager node failure can, at most, cause some delay in the consumption but will not result in any data loss since the Kafka offsets are committed only when the realtime segments are published. Am I correct with the understanding?

Hi Siva,
I had the same understanding as you told but It was not actually true.
I tried the Kafka Ingestion Task without replication factor. Below are the observations in case of Middle Manager Node failure.

  1. Supervisor went into an ‘unhealthy_tasks’ state and never recovered from it.
  2. It resulted in data loss as well.

@Gaurav, Would you please let me know, If I am wrong?

Regards,
Johny Nainwani

It depends on configuration of your supervisor task. By design a task failure shall not cause data loss. In case if Supervisor is hard reset then offset is reset based on the policy defined in the spec. Supervisor state is unhealthy if it’s running tasks error out. A supervisor can recover from an unhealthy state if it’s tasks are successful at some later time. Also, offset may go out of range if your Kafka drops messages before they are ingested. One reason is if your ingestion tasks are not running for a few days (paused supervisor, multiple task failures for a day or two and data ingestions are not happening). Kafka will drop older messages and your offset will go out of range and once the supervisor is hard-reset you will resume from a newer offset.