Hi,
I am trying to cut back the cost of running Druid in my company. We run a 150 nodes Druid cluster with 80 nodes just for middle managers. Each middle manager handles 5 tasks.
I am wondering if some type of indexing tasks had some kind of checkpoint or ability to restart upon a host restarting. Having indexing tasks that are able to resume would allow us to use AWS spot instances and allow us to test the resilience of our infrastructure more often as loosing an instance would be a more common occurence.
The data comes from Kafka, is transformed by Spark Streaming and then sent to Druid using the Tranquility framework. We are open to change this flow if there is a way to make the middle manager tasks restart if the host restart.
Let me know if such feature exist or if you have an idea on how to implement this.
Thank you!