Distinct Aggregator with Windowless ingestion

Hi All

Reading the Distinct Count Aggregator requirement at [1], we require to have a all rows with a particular value for that dimension go into the same segment.

But with Windowless ingestion I don’t think there is any option to partition the segment on the basis of the particular dimension value. Is it possible to use the Distinct count aggregator with windowless ingestion?

At the publisher side we can try to make a change, to publish the data in Kafka such a way that data for a particular dimension value always go to same dimension.

[1] http://drid.io/docs/latest/development/extensions-contrib/distinctcount.html

Thanks

Himanshu

Typo above : At the publisher side we can try to make a change, to publish the data in Kafka such a way that data for a particular dimension value always go to same kafka partition

not sure what is the question here ?

Hi Slim,

For the Distinct Count Aggregator [ http://drid.io/docs/latest/development/extensions-contrib/distinctcount.html] , we have these questions :

**Question 1 (Just a Double Confirmation Needed on the Doc Understanding) : **

Following is the Segment Name structure by Druid :

datasource_intervalStart_intervalEnd_version_partitionNum

Now As per following line in doc ,

… make sure all rows with a particular value for that dimension will go into the same segment, or this might over count …

This ensures that if I have specified 4 shards (partitions) of my segment in druid for one segment Granuality (say 1 hour) , so one particular value of “that” dimension should go to only one & same partition out of these four , or it can be present in any of the partition of that particular Segment .

Question 2 :

If I am ingesting Events in Druid with Kafka Indexing Service (No Tranquility , No Batch Ingestion) , will the distinctCount aggregator work properly ? Because in Kafka Indexing Service, Druid directly consumes from Kafka Partition , So how and where we can ensure that * all rows with a particular value of that dimension goes into same segment (or partition, depends on Question 1 answer)*.

Hope the Questions are clear :slight_smile:

Thanks,

Pravesh Gupta

Hi Slim,

For the Distinct Count Aggregator [ http://drid.io/docs/latest/development/extensions-contrib/distinctcount.html] , we have these questions :

**Question 1 (Just a Double Confirmation Needed on the Doc Understanding) : **

Following is the Segment Name structure by Druid :

datasource_intervalStart_intervalEnd_version_partitionNum

Now As per following line in doc ,

… make sure all rows with a particular value for that dimension will go into the same segment, or this might over count …

This ensures that if I have specified 4 shards (partitions) of my segment in druid for one segment Granuality (say 1 hour) , so one particular value of “that” dimension should go to only one & same partition out of these four , or it can be present in any of the partition of that particular Segment .

Values from the same segment interval need to be grouped together within the same time partition.

Question 2 :

If I am ingesting Events in Druid with Kafka Indexing Service (No Tranquility , No Batch Ingestion) , will the distinctCount aggregator work properly ? Because in Kafka Indexing Service, Druid directly consumes from Kafka Partition , So how and where we can ensure that * all rows with a particular value of that dimension goes into same segment (or partition, depends on Question 1 answer)*.

I think you can achieve this by using the actual dimension as kafka partition key thus all the values will be grouped together but my fear is that if you have late events this will not work.

So the bottom line i think if you can not guarantee that all the events will be available before handoff this will not work with realtime.