Hadoop reindex job from existing data source have all metrics equal to 0

Escalation of this topic since the question have now a bit different topic.

In short we want to use our existing 5 minute granularity datasource and create a hadoop reindexing job so to introduce new data source that have 1 hour granularity with aggregated data.

Here’s the task JSON:

{
“type”: “index_hadoop”,
“spec”: {
“dataSchema”: {
“dataSource”: “new-datasource-index-1h”,
“parser”: {
“type”: “hadoopyString”,
“parseSpec”: {
“format”: “json”,
“timestampSpec”: {
“column”: “timestamp”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions” : [“aaa”, “bbb”, “ccc”, “ddd”, “eee”],
“dimensionExclusions” : [“xxx”, “yyy”],
“spatialDimensions” :
}
}
},
“metricsSpec”: [
{ “name”: “event_qty”, “type”: “longSum”, “fieldName”: “count”},
{ “name”: “sum_value”, “type”: “longSum”, “fieldName”: “val”},
{ “name”: “min_value”, “type”: “longMin”, “fieldName”: “val”},
{ “name”: “max_value”, “type”: “longMax”, “fieldName”: “val”}
],
“granularitySpec”: {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “hour”,
“rollup” : true,
“intervals” : [“2017-12-01T00:00:00.000/2017-12-01T04:00:00.000”]
}
},
“ioConfig”: {
“type”: “hadoop”,
“inputSpec”: {
“type”: “dataSource”,
“ingestionSpec”: {
“dataSource” : “existing-druid-datasource”,
“intervals” : [“2017-12-01T00:00:00.000/2017-12-01T04:00:00.000”],
“metrics” : [“count”, “val”]
}
}
},
“tuningConfig”: {
“type”: “hadoop”,
“maxRowsInMemory”: 15000000,
“numBackgroundPersistThreads”: 0,
“jobProperties”: {
“mapreduce.job.classloader”: “true”
}
}
}
}

``

Our problem is that after this task is done metrics in new data source are always equal to 0. For example here are some rows from 5 minute granularity data source:

{
“__time” : “2017-12-01T01:05:00.000Z”,
“aaa” : “some-aaa-value”,
“bbb” : “some-bbb-value”,
“ccc” : “some-ccc-value”,
“ddd” : “some-ddd-value”,
“eee” : “some-eee-value”,
“xxx” : “151128900”,
“yyy” : “some irrelevant text”,
“count” : 1,
“val” : 59287810
},
{
“__time” : “2017-12-01T01:05:05.000Z”,
“aaa” : “some-aaa-value”,
“bbb” : “some-bbb-value”,
“ccc” : “some-ccc-value”,
“ddd” : “some-ddd-value”,
“eee” : “some-eee-value”,
“xxx” : “151129200”,
“yyy” : “another one irrelevant text here”,
“count” : 1,
“val” : 17548

}

``

And here is the same row but in new data source:

{
“__time” : “2017-12-01T01:00:00.000Z”,
“aaa” : “some-aaa-value”,
“bbb” : “some-bbb-value”,
“ccc” : “some-ccc-value”,
“ddd” : “some-ddd-value”,
“eee” : “some-eee-value”,
“event_qty” : 0,
“sum_value” : 0
“min_value” : 0
“max_value” : 0
}

``

As you can see two dimensions were not put into new data source (as expected), as well as all two metrics, and 4 new metrics were introduced. But as you can see for some unknown reason they are equal to 0.

Why this happens? I have specified that in 5 minute data source (from where I take initial data) “count” and “val” are metrics, and I want to do aggregation on them in new data source, which I have specified in dataSchema’s metricsSpecs.

Please help!

Hm, is it possible that you don’t have any rows with “count” or “val” metrics in the interval [“2017-12-01T00:00:00.000/2017-12-01T04:00:00.000”]?

For anyone who has the same issue, this is being discussed at https://github.com/druid-io/druid/issues/5277