Require help in big cluster

Hello Druid

We are using bigger cluster about 150 peons with 75 partitions. In general druid ingestion is good. But all of the sudden hand off slows down after couple of days and tasks getting piled up. This would cause the ingestion to slow down and eventually we see the exception in tranquility saying that “NoBrokerException” , i.e., none of the brokers are in discoverable mode. We loose entire ingestion after this.

We know that none of the systems in the world is ideal

My question is

How to detect the slowness of handover and how to stop some of the hanged tasks, either by shutdown them or by removing that middleManager node. Or in other words whats the best practises you suggest in this scenario. Overall goal is to stop complete ingestion eventually after couple of hours

Thanks

Bhaskar

Hi, managing such big cluster need to be done using some monitoring/alerting systems.

you can use this simple metric collector to collect metrics via http and dump it to kafka then read it back via a small druid cluster.

For instance you can monitor the value of ‘ingest/handoff/count’ . and setup alerts based on your system threshold.

Hey Vijaya,

In addition to Slim’s point about setting up a metric collector, another good thing to do is monitor the size and capacity of historical nodes, since the most common reason for handoff not working is that the historical nodes are full. You can also monitor the number of unavailable segments, since that includes back-logged handoffs. The new metrics segment/unavailable/count and segment/underReplicated/count coming in Druid 0.9.2 will help with this, but you can also do it with the coordinator “/druid/coordinator/v1/loadstatus?simple” API today.

You can also set handoffConditionTimeout in your tuning config (see http://druid.io/docs/latest/ingestion/stream-pull.html). Generally this isn’t necessary if you are monitoring historical nodes – but it is an option if you need another layer of protection.

You could also switch gears a bit and try the new Kafka indexing service rather than Tranquility. This is a new experimental feature in Druid 0.9.1.1 but it is designed to ultimately be more robust than the previously existing realtime ingestion methods. In particular, even if handoff is stuck or if tasks fail, it can always resume from Kafka and avoid losing data. We have a tutorial here if you want to try that out: http://imply.io/docs/latest/tutorial-kafka-indexing-service.html, as well as a blog post explaining how it works: http://imply.io/post/2016/07/05/exactly-once-streaming-ingestion.html.