Handling late events due to service outage for real-time indexing

Hi,

I’m working on a real-time event indexing service using Druid. The setup is using Kinesis + Flink + Tranquility to ingest data in real-time. In case of a service outage, such as Flink or Druid is down, when the system recovers, all the backlog in the kinesis stream will become late events. It seems tranquility will just ignore all those events that fall out of the windowPeriod. The windowPeriod is calculated using system clock, which means if in worst case, let’s say, the system is down for a day, then a whole day’s backlog events will be dropped. Is my understanding correct?

What’s the best practice to handle late event caused by temporary system outage or network partition for real-time ingestion?

One workaround I can imagine is to specify a very large windowPeriod to reduce the possibility of data loss, but it will definitely be inefficient. I also notice it’s possible to switch a different timekeeper, has anyone tried to use a timekeeper that return the event time rather than system time? I’m not sure if it will cause problem if the timekeeper returns non-incremental timestamps.

Thanks a lot,

Jiaji

Hi Jiaji,

In general if you want to be able to handle late data with Druid you have a couple options.

  1. Lambda architecture (older style), see here for some info: http://lambda-architecture.net/stories/2014-08-27-kafka-storm-hadoop-druid. It would use Tranquility for the realtime component and Hadoop for the batch component.

  2. Kafka indexing (newer style), see here for the philosophy on that: https://imply.io/post/2016/07/05/exactly-once-streaming-ingestion.

In general I have seen that (2) applies better to a wider variety of applications.

Thanks Gian. The Kafka indexing looks neat. I’m not currently using Kafka indexing because I want to keep a minimal dependency (we already use Kinesis, but not Kafka). I haven’t read the Kafka indexing code, do you think it’s feasible to implement similar service on top of Kinesis?

Hi Jiaji,

It is, and in fact we include that as an extension in the Imply distribution of Druid, which you can get at https://imply.io/get-started (check out the druid-kinesis-indexing-service extension, with documentation at https://docs.imply.io/cloud/manage-data/ingestion-kinesis).