so our payment model depends on (exact) unique counts per day. We have segmentGranularity=1day, and all historical data is good. However segments from realtime return wrong figures (higher), because how druid internally stores the data. What happens is for example with intermediatePersistPeriod<segmentGranularity:
1, a new intermediate segment (spill, i’m not sure about the right word) is created
2, a new event is received, it is inserted with count=1
3, the same event is received, it increments counter, count=2
4, intermediatePersistPeriod pass, a new intermediate segment (spill) is created
5, the same event is received, it is inserted with count=1
so now a count aggregator returns 2, a longSum 3. I saw other people encounter this issue, but for them generally the longSum is the solution (because they are interested in the total count, not the unique count), unfortunately not for us. As i said once handoff happens it does aggregate intermediate segments (spills), so after that the count returns the right figure again.
We have tried nested groupBy queries, but the performance was bad (i guess expected). Currently our solution is to do intermediatePersistPeriod=24H (which in turn requires windowPeriod=24H to not loose data when it fails), which feels like an abuse. We were wondering if Appenderators, or like the epic changeset coming to Realtime solves this, or are there any plans fixing this (given it’s not just a ‘known way of working’, in which case not sure what other option we would have)?