Simple (?) hybrid batch/streaming question

Hi - I’m still working through whether/how a hybrid batch/streaming ingestion setup will help us.

As I understand it, as we ingest realtime events a realtime node will periodically build segments and hand them off to historical nodes. Batch ingestion will completely replace those segments when it occurs.

Let’s say we have a one hour granularity. The goal of the hybrid setup is to not do a batch ingestion for a particular hour until we know that all of that events for that hour that have been ingested by realtime are in the batch, via whatever method is going on to build the batch - is that correct? So typically you’d just wait until the hour is “in the books” before you consider taking a batch for that hour?

Say you do that and you’ve completed a batch ingestion for the previous hour. What a realtime event is somehow delayed so that it is indexed in the last hour’s segment, but only hits a realtime node now? Will the realtime node build a new segment for that hour, hand it off and thus replace what was ingested as a batch? Is that just something fix by configuring the realtime nodes to ignore “stale” events?

Hey Ryan,

This is what the “windowPeriod” config is for. The idea is that if your segmentGranularity is HOUR, and windowPeriod is PT10M, you should wait a couple hours before starting any batch job for that hour. Realtime nodes will then be dropping any data for that hour (since it’s outside the windowPeriod) and therefore there will be no conflict.

I see, thanks.

The thing I’m struggling with is that the events we are tracking are phone calls, which can last up to 4 hours. If we send the realtime event at call end but indexed at call start time, then in windowPeriod terms, we’d need PT250M or something like that, which would mean we’d either have to widen the granularity or live with having a larger windowPeriod than granularity which the docs say not to do.

I know this is outside the usual use case, and the answer is probably “don’t do that” - emitting it at call start time obviously works better with druid but limits the amount of data we can put in the event. I’m trying to work through how we can fit that with our requirements. Our use case definitely has some things about it that are a bit unusual. When we have it all worked out I’ll post here about it.

Hey Ryan,

Having a windowPeriod of 4 hours with a segmentGranularity of HOUR should work if you are willing to deal with a bunch of overlapping tasks (you’ll have 5 sets of tasks running at a time). The general guidance to have windowPeriod < segmentGranularity is for two reasons,

  1. Avoid capacity issues due to overlapping tasks

  2. A fiddly technical reason: avoid issues caused by re-use of serviceKeys (see https://github.com/druid-io/tranquility/pull/171). Once Druid 0.9.1 is released, this can be solved by setting druidBeam.taskLocator = overlord, which will stop the cycling serviceKey from being used for anything important.

If you are aware of these two things then it’s okay to bend the rules a bit.

For HOUR granularity, the cycling of (2) happens on a 24-hour rotation, so a 4 hour windowPeriod should be okay.