Druid Segment Propagation - Metadata Storage Down

We have a use case where we want to restart our Metadata Storage in production.

As per the details given here : http://druid.io/docs/latest/design/realtime.html , there is write(metadata) in the flow chart .

As part of this restart do we need to take care of any other druid nodes as well.

Assuming a case that Metadat is down and Indexing Node is trying got write to metadata. So what will happen in such scenarios as this could happen in our case if we are restarting only metadata storage.

Do we need to expect any downtime here ?


Pravesh Gupta

The writes to the metadata are done as part of task actions which have a configurable retry policy.
So If your metadata is temporary unavailable, druid overlord will retry the insert segment operation.

http://druid.io/docs/latest/configuration/indexing-service.html refer to configs for ‘druid.peon.taskActionClient.retry’

Apart from writes to metadata storage, segment handoff to historicals will also not work and it is expected for tasks to take longer than expected, usually this may lead to the middlemanagers getting to full capacity and once that happens any new tasks submitted will stop running due to non-availability of any free slots.

Thanks Nishant for the explanation.

So what would you recommend to do ?

Should be bring down the Other Druid Nodes as well ?

Or we can be sure that segments will not be lost or no data loss would happen ?


Pravesh Gupta

One more question :

At what time we should restart the metadata store, given that we have segment granularity of 1 hour and buffer period of 15 minutes for realtime ingestion.

Would there be any issue if we are restarting the metadata at the time segment handoff by realtime nodes.

Note that we are using 0.9.2 Druid here.

Hi Pravesh,
No need to bring down any druid nodes, existing nodes should work fine on metadata storage restart.
there will not be any data loss with the restart in case the restart completes with the retry period configured on the middlemanagers.
Depending on how much downtime you would need for the metadata storage, you may want to first increase the retry configs on the middlemanagers and then do the restart.