Filter the rows by their "count".

Hi, I am trying not to roll-up data in Druid, I want all my entries stay like they are “raw”.

In order to achive that, I configured everything.

But my batch-ingestion logs (and when I query in Druid, I got the same results) says:

Map-Reduce Framework
		Map input records=8479839
		Map output records=8478194

Is that means some roll-up has occured ?

If so, how can I see them?

When I query like this:

{ “queryType”: “groupBy”,
“dataSource”: “testdata”,
“granularity”: “minute”,
“dimensions”: [“serialnumber”, “source”],
“limitSpec”: { “type”: “default”, “limit”: 10},
“aggregations”:
[ { “type”: “count”, “name”: “count”} ],
“intervals”: [ “2016-02-01T00:00:00.000/2016-03-01T00:00:00.000” ],
“having”: { “type”: “greaterThan”, “aggregation”: “count”, “value”:1 },
“context” : { “skipEmptyBuckets”: “true” }
}

``



It returns this:

{ “error” : “Unknown exception” }



If I set granularity of the same query to "all", there are results.

How can I get the entries those "count" value is greater than 1, other than this query? I want to see where the roll-up has occured.

Thanks.

The Map phase doesn’t do rollup, that happens in the Combine and Reduce phases. Usually records dropped during the Map phase are dropped because their timestamp is out of bounds of your “intervals” or because something is wrong with their formatting.

Hi!

There is no misformatted or out of interval timestamp(AFAIK). I double checked, changed the intervals etc.

They are coming from MongoDB, I checked their robustness there too. There is no unparsable timestamp or something “error” in ingestion logs too.

So how is this possible you think ?

Out of bounds timestamps and unparseable rows are really the only reasons I know of for rows to get dropped during the Map phase.

It is possible you have some timestamps that look good but Druid still can’t parse, or some rows without timestamps. Are you running all your Hadoop nodes in the UTC timezone? Running in other timezones can cause time-related issues.

If you are confident that your input data is correct and that server timezone is UTC, it’s possible that you’re hitting a bug in the indexer. In that case, could you post your indexing spec?