Difference between LongSum Aggregation and Count

Hi,

I’m using Kafka feeding a Storm Cluster with tranquility bolts (v 4.2) serving a 0.7.1.1 Druid Cluster.

During my ingestion, I index the count by minute using the count aggregator, but when I group my data by hour, I find a difference (~5 to 10%) between a raw count on my segments and a longSum using my count aggregation.

Since I also store my data on a Mysql server, it appears that the raw count seems to be the correct number.

Even weirder: if I group my data by minute, LongSum(computed_count) is different than a raw count and this time the right number according to Mysql seems to be LongSum(computed_count).

Any idea what could be the cause? Is there anything I’m missing here?

Thanks,

Pierre-Edouard

This question comes up on occasion.

It is very likely that a longSum on the “count” metric you defined upon ingestion is the number you are looking for.

A “count” metric at query time is asking the druid engine to count the number of optimized-events it has. If it is able to optimize events based on query granularity of the segment, then it will return a smaller “count” compared to “longSum”.

Knowing this, is there still a difference in what you expect vs what your query is returning?

Thanks,

Charles Allen

Hi Charles,

Thanks for your help. After investigating, we found out we had an issue with duplicates and we modified our storm topology to handle this issue. LongSum now seems to produce the right results.

From what I understand, Druid’s internal optimizations were producing a very close result to deduplication when counting whereas LongSum was producing the real results.

Thanks,

Pierre-Edouard Montabrun

Hi Pierre, you may want to read up on how Druid does rollup at ingestion time:
http://druid.io/docs/latest/

It should hopefully clarify why the number of Druid rows often do not match the number of ingested rows.