Druid queries on middlemanager and segment handoff

Hi,

We are usind druid-0.8.1 version and we need your assistance towards the following questions

  1. We are using Indexingservice to index for the granularity of 1hr. We allocated worker capacity a ‘X’ and went to production. Over the period, due to the increase in the volume and corresponding increase in the cardinality, we wanted to reduce this worker capacity so that we can increase the JVM parameters especially memory allocated for each peon task. But inorder to do that, it seems we have to stop the middlemanager and make these changes then restart it. Is there a way we can autoreload the config changes in the middlemanager so that without restarting it, the changes should take effect from the next segment generation?

  2. We use Hadoop as deep storage and once there was an issue in our Hadoop cluster and that caused all of our tasks exit without segment handoff and hence we lost the data of 1 hour as our segment granularity is 1 hr. Is there a way we can handle this case so that we won’t lose the data for example having secondary storage medium so that data can be persisted to secondary storage when it encounters issue in a primary and automatically moved to primary when it’s available by periodic check of primary storage?

  3. We are using equalDistribution strategy in overlord to spawn tasks in middlemanager nodes and we do generate a segment in an hour granularity. The observed behavior is, previous hour segment will stick around in middlemanager nodes for few minutes till it gets handed off to deep storage and by that time we will have 2 sets of task running (one for previous hour and one for current hour)on each node.

incase if we want to add any new middlemanager node, it’s assigned to equal number of tasks as like as the existing nodes. But the problem is, all the old nodes will handoff 50% of their assigned task in few minutes as they are the tasks belong to previous hour whereas the new node holds only the tasks of current hour and hence it will always handle 50% more tasks than the existing nodes at every alternate hour. The behavior expected is , new nodes should be allocated number of tasks that is equal to the new tasks(tasks for current hour) allocated on the old nodes.if we do so, then the distribution would be almost equal at any point of time. Do we have any solution for this problem at this point?

Thanks,

Sithik

Hi,
Answers Inline

Hi,

We are usind druid-0.8.1 version and we need your assistance towards the following questions

  1. We are using Indexingservice to index for the granularity of 1hr. We allocated worker capacity a ‘X’ and went to production. Over the period, due to the increase in the volume and corresponding increase in the cardinality, we wanted to reduce this worker capacity so that we can increase the JVM parameters especially memory allocated for each peon task. But inorder to do that, it seems we have to stop the middlemanager and make these changes then restart it. Is there a way we can autoreload the config changes in the middlemanager so that without restarting it, the changes should take effect from the next segment generation?

You need to do this by doing a rolling restart of the middlemanagers. Refer the instructions for graceful restart of middlemanagers here - http://druid.io/docs/latest/operations/rolling-updates.html

  1. We use Hadoop as deep storage and once there was an issue in our Hadoop cluster and that caused all of our tasks exit without segment handoff and hence we lost the data of 1 hour as our segment granularity is 1 hr. Is there a way we can handle this case so that we won’t lose the data for example having secondary storage medium so that data can be persisted to secondary storage when it encounters issue in a primary and automatically moved to primary when it’s available by periodic check of primary storage?

There is no such support. The idea behind the current design is that the Deep storage itself should be configured for HA. However, for transient failures, druid tasks are expected to retry for a configurable number of times before failing.

  1. We are using equalDistribution strategy in overlord to spawn tasks in middlemanager nodes and we do generate a segment in an hour granularity. The observed behavior is, previous hour segment will stick around in middlemanager nodes for few minutes till it gets handed off to deep storage and by that time we will have 2 sets of task running (one for previous hour and one for current hour)on each node.

incase if we want to add any new middlemanager node, it’s assigned to equal number of tasks as like as the existing nodes. But the problem is, all the old nodes will handoff 50% of their assigned task in few minutes as they are the tasks belong to previous hour whereas the new node holds only the tasks of current hour and hence it will always handle 50% more tasks than the existing nodes at every alternate hour. The behavior expected is , new nodes should be allocated number of tasks that is equal to the new tasks(tasks for current hour) allocated on the old nodes.if we do so, then the distribution would be almost equal at any point of time. Do we have any solution for this problem at this point?

There seems to be no graceful ways to handle this at present with equalDistribution.

Thanks Nishant for the reply.

  1. Yes, we do know that we can do rolling restart but I would prefer to do that if we upgrade the druid package itself from one version to another. Moreover, I can’t go with rolling restart approach when my middlemanager capacity is fully used and I can’t take any single node out of the traffic due to capacity concerns otherwise I might end up with pending tasks followed by data lose.Hence probably having a solution to reload config by middlemanager before spawning task for next segment generation would avoid any restart. Eg: send a signal to middlemanager so that it can reload the config and make use of it from the next segment creation cycle.

  2. I agree incase of transient issue in the deep storage we could succeed by retry. But incase of planned maintenance of Hadoop, we fail miserably. I know that we can use lambda model to load all the missing data but I would prefer to have Indexing service having some sort of secondary storage configured so that it can make use of it and move it to Hadoop when it comes back. If it’s doable solution please think about it.

  3. Sure, do we have any plan of doing it in the upcoming versions?

Thanks,

Sithik

Hi Sithik,
Replies Inline.

Thanks Nishant for the reply.

  1. Yes, we do know that we can do rolling restart but I would prefer to do that if we upgrade the druid package itself from one version to another. Moreover, I can’t go with rolling restart approach when my middlemanager capacity is fully used and I can’t take any single node out of the traffic due to capacity concerns otherwise I might end up with pending tasks followed by data lose.Hence probably having a solution to reload config by middlemanager before spawning task for next segment generation would avoid any restart. Eg: send a signal to middlemanager so that it can reload the config and make use of it from the next segment creation cycle.

We have introduced restartable tasks which allows Middle Managers to be updated one at a time in a rolling fashion when you set druid.indexer.task.restoreTasksOnRestart=true (it’s off by default).

With this feature realtime indexing tasks will restore their state on Middle Manager restart, and will not fail and no data will be lost.

  1. I agree incase of transient issue in the deep storage we could succeed by retry. But incase of planned maintenance of Hadoop, we fail miserably. I know that we can use lambda model to load all the missing data but I would prefer to have Indexing service having some sort of secondary storage configured so that it can make use of it and move it to Hadoop when it comes back. If it’s doable solution please think about it.

Druid architecture is pluggable via extensions. you can write your own custom extension for deep storage to achieve a failover scenario like above.

  1. Sure, do we have any plan of doing it in the upcoming versions?

No plans right now.

Fwiw, druid has support for javascript worker select strategy that allows you to plug in any custom javascript code for selecting workers for assigning tasks. you can probably use that to overcome this.

Also, Feel free to submit a PR to improve the existing worker select strategies or add a new one with better logic.

Thanks Nishant for the details. Sure, let me see what can I do here.

Thanks again for all the prompt responses.