Row count issue while reindexing

Hi,

I am reindxing druid segment using non-hadoop based task. the segments has a count matrics. the reindex schema also has the count matrics but after reindexing the count metrics data is changed before reindexing the longsum on count query gives 543K then after reindexing it returns 253K . longsum query is

{“fieldName”: “count”, “type”: “longSum”, “name”: “sum__count”}

please suggest.

Here are some data…I have fetched data from druid. Group by using dimension A and get data of count , sum_count(which is longsum of count metric)
befor reindexing the segment values were as below**.**
A ** ** sum_ count count
val1 166K 126K
val2 115K 89.2K
After re-indexing (added hypervloglog matrics.
**A ** ** **sum_count count
val1 129K 126K
val2 90.5K 89.2K
Again reindexing (with removing 2 dimensions).
**A ** ** **sum_count count
val1 129K 90.4K
val2 90.5K 69.6K

Why there is difference between counts.

Are you reindexing with the count aggregator or the sum aggregator? The count aggregator is basically just a row count. If you’ve already ingested and aggregated with that, then you should switch to the sum aggregator for future operations.

—Eric

Yes, I am using count aggreegator at both ingestion time and the reindexing time. If I dont use during reindexing time then how it will be available at reindexed segments. below is my reindexing task metrics section. count I already have in my ingestion but mentioned here again and a new matric for hypervloglog.

“metricsSpec” : [
{
“type” : “count”,
“name” : “count”
},
{
“type”: “hyperUnique”,
“name”: “uniqe_user”,
“fieldName”: “userid”,
“isInputHyperUnique”: false,
“round”: false
}

]

I have removed the count metrics from metrics specs and added it in firehouse as below but the count is still mismatch…Isit because I have removed two dimenstions that had high cardinality ? If I keep these dimensions then count comes okay but longsum of count during query time gives mismatch.

“ioConfig”: {
“type”: “index”,
“firehose”: {
“type”: “ingestSegment”,
“dataSource”: “testsource”,
“interval”: “2017-09-12/2017-09-13”,
“metrics”:[“count”,“sum_value”],
“ignoreWhenNoSegments”: true
}
}

As per my understading the count may mismatch as I have removed two high cardinality dimensions in reindexing so rollup will create lesser druid rows but the longsum of count should be the same. it is also showing different number. I am using group by query for a particuler date after reindexing the data for the same date.

Replace the count aggregator with {“type”: “longSum”, “name”: “count”, “fieldName”: “count”} when reindexing.

Thanks a lot Eric. It worked …