How to know count of ingested rows in druid

I am ingesting my data in druid using realtime node. Total
no of records ingested is 40 million but when I query I see output of only 13
million record. I can see that there is no rejection of events in logs but
still count is less than my given data ingest. When going through druid doc it
says:

“To count the number of ingested rows of data, include a
count aggregator at ingestion time, and a longSum aggregator at query time.”

What is the meaning of this line ?

I am giving following
line while ingesting data:

“metricsSpec” : [{

“type” : “count”,

“name” : “COUNT”

},

And querying using below lines:

{

“queryType”: “groupBy”,

“dataSource”: “abcd”,

“granularity”: “all”,

“dimensions”: ,

“aggregations”: [

{“type”: “count”, “name”:
“count”},

{“type”: “count”, “name”: “UNIQUE_CUSTOMERS”,
“fieldName”: “CUSTOMER_ID”}

],

“intervals”: [""]

}

It gives me very13m counts(I am sure it is returning druid
row counts.), but If I try change query from(as suggested in above lines) {“type”:
“count”, “name”: “count”}, to {“type”:
“longSum”, “name”: “count”}, it gives syntax
error.

I even tried querying it using segmentQuery but this also gives me same 13m counts:

{

“queryType”:“segmentMetadata”,

“dataSource”:“abcd”,

“intervals”:[""]

}

Can anyone suggest any way to know how many original rows were ingested by druid ?

At query time, instead of {“type”: “count”, “name”: “count”} you want {“type”: “longSum”, “name”: “count”, “fieldName”: “count”}. The idea is that at indexing time you’re doing a count, but at query time you’re summing an already-computed count.

Hey everyone,

I’m also facing the same issue. The total rows ingested are
21731674 but the longSum count is 16570932. Can I recover the data loss/collapsed as this will effect the score I want to calculate ? Is there any workaround I can prevent rows getting collapsed ?

Hi Parveen,

Separate from the total row count questions, I noticed your query has this aggregator:

{“type”: “count”, “name”: “UNIQUE_CUSTOMERS”, “fieldName”: “CUSTOMER_ID”}

The “count” aggregator doesn’t provide the cardinality of a dimension and doesn’t take a “fieldName” parameter, it only counts the number of rows that are returned in a query.

To get a cardinality estimate for a column, you’d want to use a cardinality, hyperunique, or datasketch aggregator, e.g. http://druid.io/docs/latest/querying/aggregations.html#cardinality-aggregator

Hi Akul,

I am not sure what you mean by “recover the data loss/collapsed” but if you are asking how to disable Druid’s rollup summarization feature, you can do that by setting rollup: false in your granularitySpec.