We’re currently POCing a new data pipeline that uses Druid as its application-facing query store downstream from some stateful stream processing with Samza. We’re considering using Druid to store daily history of a large number of objects with a number of different dimensions identifying each row. The way we’re considering building our pipeline, our events are cumulative over time. We’re sending metric deltas to Druid from our stream processing layer that result in aggregate updates within in the queryGranularity specified in the ingestion spec, but when we cross between queryGranularity segments, we’d like to retain the current values as the base over which to apply the new period’s aggregates. For example, in our Tranquility ingestion spec, we’d set a queryGranularity of one day and all events for that day would be aggregated into their dimensional buckets. When we roll over into the next day, we’d like to pre-populate the new segments with the previous day’s data so that all streaming aggregates are initialized with the previous day’s data.
It seems like this might be a common case, and it’s basically the opposite of the handoff from the indexing nodes to the historical nodes, so I thought there might already be something in Druid to handle this case and let the existing infrastructure in Druid handle the query routing the same way a broker currently routes to indexing nodes and historical nodes based on the query interval. Is there existing configuration/process/best practice to handle this kind of situation? Or is this kind of roll-forward something I’ll need to handle at the application level?
It’s also possible that I’m abusing the queryGranularity configuration to gain a small amount of mutability on the metric aggregates during the current day, so if that’s the case, I’d be open to alternative suggestions.