[druid-user] Druid on AWS


I found Druid reference architecture published by Imply suggests i3 for data servers in AWS. i3 has local SSD, which will be wiped out on restart.
I wonder, if we have a large cluster, say 1000+ nodes and the replication factor is 2, the odds of having 2 or more servers failing at the same time is pretty high. The cluster can tolerate 1 node failure without producing incomplete result. But if with 2 or more node failure, we would lose all the segment cache on those instances, and result in incomplete data. True? How to manage that risk?


No you would not lose your data. Lets say you have 2 nodes fail that currently managing the same segment. In this case and in the case of just 1 host failure, the coordinator (I think that is the node) will detect the data has gone offline, will assign that data to another active node. When this occurs the data will be loaded from cold storage. (if you in AWS that is likely going to be S3).

i.e. remember you always have a backup copy of the sitting in S3.

so in this case, you data will become unavailable for brief time, but will not be lost. You can of course increase your number of replicas as well, to reduce this temporary data unavailability.