Indexing stops when a few MiddleManagers become stale

Hi,
I have noticed recently that when my Kafka Consumers using tranquility-core consistently fail to send to a handful of middleManagers tasks say (18 out of 60), there is a complete data loss(Out of 25 million per minute only less than 10K is being pushed). The 18 tasks reside on two middleManagers and when I removed the middle Managers the data flow was back to normal. If I had set the replication to 2 then I would have had another set of segments that could potentially save me from this data loss.

Is this the intended behaviour?

Is there a configuration I am missing that could help me circumvent this to an extent?

Or Is this an improvement that can be thought of that could mark down these nodes and not send data at all?

Would like to hear your thoughts on this

I am running on Druid 0.8.2

Tranquility 0.7.2

Replication : 1

Thanks,

Ram

Hey Ram,

Tranquility will retry sending events to the indexing tasks if they don’t receive a successful response; one possibility is that because of the constant retrying, Tranquility’s buffers fill up with data which never gets sent out. This would manifest itself by your call to send() blocking if tranquility.blockOnFull=true (default) or getting a BufferFullException if it is false. Some configuration options you can try (from https://github.com/druid-io/tranquility/blob/master/docs/configuration.md) include druidBeam.firehoseRetryPeriod, tranquility.maxBatchSize, and tranquility.maxPendingBatches, but these settings would be more useful for handling burst loads and transient failures, not consistent middle manager failures. Do you know the reason why the 18 of your tasks are failing?

The tasks were failing because they belonged to two machines that were not accepting any incoming connections. Thank you for pointing me to the blockOnFull configuration, that should help this situation out in the future.