I have a bit of an open-ended question here, apologies for not being more focused.
We currently have a production Druid up and running that gets its data entirely on index-task batch ingestion. Our primary use case is reporting, and the batches to be ingested are generated by a data warehousing team. It collects output from variety of asynchronous back-end processes and creates batches to follow an “eventually correct” model for the data. So it can take up to an hour (in a worst case) for the first appearance of an event in one of the batches, and then additional data can decorate the event over time. We just keep reingesting the same batch repeatedly as it becomes more and more correct. The hour latency for first appearance of an event is okay for the reporting use case where we’re looking at week or month long ranges and aggregating.
We’re now looking at using Druid for a secondary use case where we wouldn’t be working with aggregations, but instead displaying the events in a list form for a drill-down sort of view of the data. The Druid Select query seems great for powering this sort of view - it’s much much faster than the Postgres datastore that currently powers this view, which cannot produce results for our bigger clients in anything close to a responsive manner. However, the hour latency before first appearance of an event is not acceptable for this view. We’re looking at ways of speeding things up from the data warehouse/ingestion pipeline side, but ideally we’d like the event to show up more in a matter of a minute or two, and it would be very hard to get the pipeline up to that speed. It’s okay if some of the additional data flowing in through the batch process takes longer.
It seems like some kind of Lambda architecture / hybrid batch/streaming model might work here. But I’m having a hard time figuring it out. The fundamental issue is that if we stream in an event but then later ingest a batch for that segment which doesn’t contain that streamed event, it will be overwritten and lost. (Though eventually it will show up in the batch and get back into Druid later on.)
I’ve thought about restricting the batch ingestion to avoid overwriting recent events but it’s problematic. Our ingestion granularity is a day. Also the events we’re tracking are phone calls which can last up to 4 hours and the timestamp is the local start time of the call. So if we get a realtime event seconds after the end of a 4 hour phone call, it’s going to be placed in the segment -4 hours from the event. So would we have to avoid ingesting batches for that entire segment - basically realtime only and no batch ingestion until we are at least 4 hours past the segment boundary? Or can we restrict the range that the batch ingestion operates on using its interval in the task scheme? E.g., it’s noon and we’re notified there’s a fresh batch, can we do the ingestion so that it only covers midnight to 8 AM in today’s segment and (streamed) events already in the segment aren’t touched?
Even if so, it’s not that easy because as I mentioned, the timestamp is local start time, but we format it as UTC for ingestion. So a call which started at 1AM UTC -12 hours local time, we will use a timestamp of 1AM UTC but we’ll not actually get a realtime event until 1PM UTC at the earliest (actually later than that depending on how long the call is - 5 PM UTC if it’s a four hour call). So under that scenario we’d have to restrict the batch to -16 hours from present to avoid erasing any calls. In addition to getting really complicated and potentially error prone, that’s probably too long for us to wait to get any of the batch updates that we need for our main reporting use case.
So, any suggestions? Or anywhere my thinking is wrong? Again, apologies for the open-endedness of this question. I’ll be happy to fill in any more details that may be relevant.