failover with tranquility created task?

Hey guys,

I want use tranquility to setup our realtime pipeline, so i did some experiments to see what tranquility will do in some failover scenarios. I noticed if i killed the task which tranquility created, tranquility will start to drop every message it received, and not trying to recover the task at all. I did some more research, and this situation occurs when i kill all replicas of one segment shard. Is this a design decision or did i miss something? I think it’s really dangerous if we meet this situation in prod, because this seemed to be unrecoverable. Do you guys have some tips or some best practice when we use tranquility in prod?

Thanks

Kurt

Hey Kurt,

Yes this is a design choice made in Tranquility. The main reason is that Druid’s realtime system does not have any builtin replication of data across different machines. So Tranquility achieves replication by sending the same data to all replicas. But, if all the replicas fail, the data cannot be recovered.

Generally people approach this with some of these strategies,

  1. Avoid killing tasks, instead simply wait for them to exit (tasks automatically exit after handoff)

  2. Have enough replicas to minimize risk of losing them all due to machine failure (2 is a common choice)

  3. Use segmentGranularity to minimize amount of data lost if you do lose all replicas (hour is a common choice)

  4. Use batch ingestion to re-load data that was missed in realtime (lambda architecture)

Btw, if you are using Kafka, we are currently working on a new Kafka-based ingestion mechanism that guarantees no data loss (by tracking Kafka offsets within Druid’s metadata store). This guarantee holds even if you kill tasks or if tasks fail. Some relevant PRs there are https://github.com/druid-io/druid/pull/2220, https://github.com/druid-io/druid/pull/2730, https://github.com/druid-io/druid/pull/2844, and https://github.com/druid-io/druid/pull/2656.

Thanks Gian.