[druid-user] Re: Druid has too many rows? What could cause this?

Re: count() versus sum(count) – note that the metric, COUNT, is created at ingestion time – I presume that you have the roll-up feature turned on. Roll-up does a GROUP BY on incoming rows to generate the metrics. Therefore, just like a GROUP BY, if you have 2 rows with the same data incoming, then 1 row will be output. Doing a COUNT() tells you the actual rows in the table, SUM(count) is giving you the total incoming rows.

E.g. this command is useful to understand what roll-up ratio you are getting:

If you disable roll-up, you will have a row-for-row match incoming to table, at the cost of a performance hit.

As for 10m rows in source data versus what your SUM(“count”) is telling you, there are various approaches to understanding why you do not have the same number of rows source → ingestion. My first step would probably be to try and isolate time periods: e.g. one day in one system versus one day in another – to see if I can pinpoint the point in time where there is a discrepency. I can then move on to looking at the ingestion task logs to see if there were failures on particular files etc.

  • Pete

(Also noting that a good Druid query always includes a time filter anyway –:smiley: )

If you have 10k input rows, and 26k in druid, that’s pretty odd. Is it possible you loaded more than once, using AppendToExisting: true?

Can you check the counts with some grouping columns? My thinking is that since dynamic partition does not guarantee perfect rollup it is possible to have duplicate rows for the dimensions ( the metrics will aggregate correctly) and this can lead to misleading count(*). Specifying a grouping column for the count will take care of this.