Hi,
I’m working on a real-time event indexing service using Druid. The setup is using Kinesis + Flink + Tranquility to ingest data in real-time. In case of a service outage, such as Flink or Druid is down, when the system recovers, all the backlog in the kinesis stream will become late events. It seems tranquility will just ignore all those events that fall out of the windowPeriod. The windowPeriod is calculated using system clock, which means if in worst case, let’s say, the system is down for a day, then a whole day’s backlog events will be dropped. Is my understanding correct?
What’s the best practice to handle late event caused by temporary system outage or network partition for real-time ingestion?
One workaround I can imagine is to specify a very large windowPeriod to reduce the possibility of data loss, but it will definitely be inefficient. I also notice it’s possible to switch a different timekeeper, has anyone tried to use a timekeeper that return the event time rather than system time? I’m not sure if it will cause problem if the timekeeper returns non-incremental timestamps.
Thanks a lot,
Jiaji