I`m trying to have as reliable data as soon as possible in my druid cluster.
- I have unrealiable realtime data source
- doesn`t have all the necessary info at the time of event creation
- events can arrive late
- I have realiable data source with all necessary data
- data is available shortly after midnight for previous day
running realtime nodes, ingesting kafka topic with 1 hour window period and every 2 a.m. there is an index job correcting realtime data for previous day
- In this scenario I can end up with window in realtime data if realtime node gets stucked for more than 1 hour
maybe widening window period to lets say 6 hours can solve my problem with unrealiable source of realtime data, but is there any way to tell druid to ignore segments created by realtime node if there already are segments created by index job?
I was also trying the new supervised kafka realtime ingestion without window period and again index job running at 2.a.m. for previous day
- In this scenario I can end up with duplicated events when realtime node gets stucked just before the end of the day, at 2 a.m. index job finishes succesfully and rt node starts and ingest older events
is there any way to tell supervised kafka ingestion not to update segments created by index job?
I want to run index job as soon as possible to have 100% reliable results but at the same time I want to have as wide window period as possible on realtime node as events for can arrive late in my realtime source
do you have any suggestions how to achieve this?