Druid supports streaming real-time data through real-time tasks that are created on the fly as and when data arrives.
However consider the use case where data would be generated real-time, captured in real-time and intended to be real-time for druid may be delayed due to a sub-system going down for a while. This sub-system is the one that forwards the real-time data (events) to another system which uses tranquility library to send the data to druid. In normal scenario, the data would flow like:
A (Generates real-time data) -> B (collects and forwards real-time data) -> C (module that uses tranquility and sends data to Druid) -> Druid
Now, If B has some trouble sending data to C for a brief time (say 15 mins) due to some failures but doesn’t loose it would send the data once starts fully functioning again. However now the timestamp would be older and such messages would be dropped.
I understand that this is by design and druid doesn’t keep lot of real-time tasks running for performance and other reasons and tasks are stopped regularly after accounting for interval time, windowPeriod etc. However above is a valid use case when dealing with real-time data since the data we are dealing with isn’t batch data (event the data which gets delayed) . What is the best way to handle such a scenario?
A trivial approach would be to move the messages that get dropped to say kafka and perform batch ingestion at regular intervals. However this would result in incorrect aggregates since data stored in druid will not reflect the true aggregates till batch job runs.
Try without tranquility, it works in my case.
But i do want to use tranquility to stream real-time data into druid.
Prathamesh, Druid would only reject duplicate data with same timestamp. As long as you have a good retention time in tranquility and resources to accommodate it and appropriately configured druid parameters, data should get ingested. Of course the reports meanwhile might not have data from real time ingestion.
Are you seeing it to be otherwise?
Have you had a change to look at the kafka indexing service - http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
Currently WindowPeriod is one of the limitations with tranquility based ingestion.
There is no such limitations with kafka indexing service and you can get exactly once ingestion with support for handling late arriving data.
Currently using tranquility one would not be able to ingest data if the timestamp is outside the defined windows period. i.e. if the window period is specified as 15 mins, at any given time t, an event that has timestamp +/- 15 mins will be successfully ingested. However if the timestamp exceeds this windows period then that event is dropped. For eg. if at 10.15 AM an event is send which was generated at 9.30 AM, this event will be dropped if windows period is 15 mins.
Of course if window period is increased, one will be able to ingest data with little more delay. however increasing window period would mean keeping the real-time task running for longer which is undesirable. This is a small limitation of tranquility as Nishant mentioned.
We are actually processing and transforming events coming onto our kafka topic with a flink job before ingesting in druid using tranquility. I was more inclined to using tranquility, but do you think that sending the processed events to another kafka topic and ingesting from there would be a better approach?
In case you need Exactly one ingestion and delayed event processing, getting your processed events into a final kafka topic and ingesting via that would be better.
Tranquility however works on best-effort and does not have the ingestion guarantees. See Guarantees under https://github.com/druid-io/tranquility/blob/master/docs/overview.md for more details.
I used to run batch ingestion to ingest delayed events.
2019년 1월 4일 금요일 오후 1시 43분 34초 UTC+9, prathamesh 님의 말: