I have an use case where I would want to do Roll Up with LongMax Strategy for duplicated events (Same Timestamp & Same Dimension Values) which are separated by more than taskDuration time length. So Basically, these events would end up in different Segment.
But If My desired Roll Up Strategy is longMax then no matter what I query for that Metric(s), I am bound to get incorrect result. Aggregation at the query time on metric is longSum, lets say.
Is this a known issue and limitation with Kafka Indexing Service ?
If it is, Shouldn`t it be mentioned in the Documentation of Druid.
Please help me here.
Thanks in advance,
Yes, in general, rollup is not “guaranteed” with realtime ingestion. It will happen if the events are in the same kafka partition and arrive close in time, but it’s not guaranteed if they arrive far in time. A note on http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html would be useful, perhaps you’d like to raise a patch to add a note like that to “On the Subject of Segments” on https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/kafka-ingestion.md?
It’s possible to achieve full rollup after initial ingestion by running reindexing tasks, as the doc mentions.
Thanks Gian for the Confirmation.
After giving it a thought, Its not even possible to handle such kind of scenarios . Because we cannot rollup two events if One is in Middle Manager Memory and another one in some Segment in deep storage. I am thinking of it as a trade off between Windowless and Windowed Druid Ingestion.