Incorrect value for new metric in index_hadoop (reindexing segments)

Hi,

We have segments generated by the Kafka Indexing Service, that we are compacting via a index_hadoop task,

however it seems that adding new metrics (new name) doesn’t aggregate correctly (without error still)

In this exemple, the source segments have 1 metric count and we derive 3 metrics out of it

{
“type” : “index_hadoop”,
“spec” : {
“dataSchema” : {
“dataSource” : “default__core_app_target_new_metrics”,
“parser” : {
“type” : “string”,
“parseSpec” : {
“format” : “json”,
“timestampSpec” : {
“column” : “sys_timestamp”,
“format” : “auto”
},
“dimensionsSpec” : {
“dimensions”: ,
“dimensionExclusions” :
}
}
},
“metricsSpec” : [
{
“name” : “count_became_longSum”,
“type” : “longSum”,
“fieldName” : “count”
},
{
“name” : “count_became_NewCount”,
“type” : “count”,
“fieldName” : “count”
},
{
“name” : “count”,
“type” : “count”,
“fieldName” : “count”
}
],
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “HOUR”,
“queryGranularity” : “NONE”,
“intervals” : [“2017-02-09T04:00:00.000Z/2017-02-09T05:00:00.000Z”]
}
},
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “dataSource”,
“ingestionSpec” : {
“dataSource”: “default__core_app”,
“intervals” : [“2017-02-09T04:00:00.000Z/2017-02-09T05:00:00.000Z”],
“segments”: [
{
“dataSource”: “default__core_app”,
“interval”: “2017-02-09T04:15:00.000Z/2017-02-09T04:30:00.000Z”,
“version”: “2017-02-09T04:15:10.858Z”,
“loadSpec”: {
“type”: “hdfs”,
“path”: “/apps/druid/segments/default__core_app/20170209T041500.000Z_20170209T043000.000Z/2017-02-09T04_15_10.858Z/0/index.zip”
},
“dimensions”: “colA,colB”,
“metrics”: “count”,
“shardSpec”: {
“type”: “numbered”,
“partitionNum”: 0,
“partitions”: 0
},
“binaryVersion”: 9,
“size”: 41547,
“identifier”: “default__core_app_2017-02-09T04:15:00.000Z_2017-02-09T04:30:00.000Z_2017-02-09T04:15:10.858Z”
},
{
“dataSource”: “default__core_app”,
“interval”: “2017-02-09T04:30:00.000Z/2017-02-09T04:45:00.000Z”,
“version”: “2017-02-09T04:30:20.725Z”,
“loadSpec”: {
“type”: “hdfs”,
“path”: “/apps/druid/segments/default__core_app/20170209T043000.000Z_20170209T044500.000Z/2017-02-09T04_30_20.725Z/1/index.zip”
},
“dimensions”: “colA,colB”,
“metrics”: “count”,
“shardSpec”: {
“type”: “numbered”,
“partitionNum”: 1,
“partitions”: 0
},
“binaryVersion”: 9,
“size”: 21471,
“identifier”: “default__core_app_2017-02-09T04:30:00.000Z_2017-02-09T04:45:00.000Z_2017-02-09T04:30:20.725Z_1”
},
{
“dataSource”: “default__core_app”,
“interval”: “2017-02-09T04:45:00.000Z/2017-02-09T05:00:00.000Z”,
“version”: “2017-02-09T04:45:10.721Z”,
“loadSpec”: {
“type”: “hdfs”,
“path”: “/apps/druid/segments/default__core_app/20170209T044500.000Z_20170209T050000.000Z/2017-02-09T04_45_10.721Z/0/index.zip”
},
“dimensions”: “colA,colB”,
“metrics”: “count”,
“shardSpec”: {
“type”: “numbered”,
“partitionNum”: 0,
“partitions”: 0
},
“binaryVersion”: 9,
“size”: 40119,
“identifier”: “default__core_app_2017-02-09T04:45:00.000Z_2017-02-09T05:00:00.000Z_2017-02-09T04:45:10.721Z”
},
{
“dataSource”: “default__core_app”,
“interval”: “2017-02-09T04:30:00.000Z/2017-02-09T04:45:00.000Z”,
“version”: “2017-02-09T04:30:20.725Z”,
“loadSpec”: {
“type”: “hdfs”,
“path”: “/apps/druid/segments/default__core_app/20170209T043000.000Z_20170209T044500.000Z/2017-02-09T04_30_20.725Z/0/index.zip”
},
“metrics”: “count”,
“dimensions” : “colA,colB”,
“shardSpec”: {
“type”: “numbered”,
“partitionNum”: 0,
“partitions”: 0
},
“binaryVersion”: 9,
“size”: 23962,
“identifier”: “default__core_app_2017-02-09T04:30:00.000Z_2017-02-09T04:45:00.000Z_2017-02-09T04:30:20.725Z”
},
{
“dataSource”: “default__core_app”,
“interval”: “2017-02-09T04:00:00.000Z/2017-02-09T04:15:00.000Z”,
“version”: “2017-02-09T04:00:20.623Z”,
“loadSpec”: {
“type”: “hdfs”,
“path”: “/apps/druid/segments/default__core_app/20170209T040000.000Z_20170209T041500.000Z/2017-02-09T04_00_20.623Z/0/index.zip”
},
“dimensions”: “colA,colB”,
“metrics”: “count”,
“shardSpec”: {
“type”: “numbered”,
“partitionNum”: 0,
“partitions”: 0
},
“binaryVersion”: 9,
“size”: 40621,
“identifier”: “default__core_app_2017-02-09T04:00:00.000Z_2017-02-09T04:15:00.000Z_2017-02-09T04:00:20.623Z”
}
]
}
}
},
“tuningConfig” : {
“type”: “hadoop”,
“forceExtendableShardSpecs”: “true”
}
}
}

``

doing a select over the result shows that the new named metrics are incorrectly calculated (count_became_NewCount and count_became_longSum should have a value of 1)

[
{
“timestamp”: “2017-02-09T04:00:10.000Z”,
“result”: {
“pagingIdentifiers”: {
“default__core_app_target_new_metrics_2017-02-09T04:00:00.000Z_2017-02-09T05:00:00.000Z_2017-02-13T11:10:57.153Z”: 1
},
“dimensions”: [
“colA”,
“colB”
],
“metrics”: [
“count”,
“count_became_NewCount”,
“count_became_longSum”
],
“events”: [
{
“segmentId”: “default__core_app_target_new_metrics_2017-02-09T04:00:00.000Z_2017-02-09T05:00:00.000Z_2017-02-13T11:10:57.153Z”,
“offset”: 0,
“event”: {
“timestamp”: “2017-02-09T04:00:10.000Z”,
“colA”: “Event 1”,
“colB”: “Awesome”,
“count”: 1,
“count_became_NewCount”: 0,
“count_became_longSum”: 0
}
},
{
“segmentId”: “default__core_app_target_new_metrics_2017-02-09T04:00:00.000Z_2017-02-09T05:00:00.000Z_2017-02-13T11:10:57.153Z”,
“offset”: 1,
“event”: {
“timestamp”: “2017-02-09T04:00:10.000Z”,
“colA”: “Event 10”,
“colB”: “Best”,
“count”: 1,
“count_became_NewCount”: 0,
“count_became_longSum”: 0
}
}
]
}
}
]

``

Is that a limitation of the index_hadoop when reading from segments?

Thanks

Same problem. Can anyone help?

Same problem +1

在 2017年3月3日星期五 UTC+8下午3:38:06,Liz Lin写道:

Same Problem …

Hey Pierre (and others),

Firstly I think there is a misunderstanding of what the “type” : “count” aggregator does. It just counts the number of input rows, and doesn’t read from any particular column, so “fieldName” is ignored. {“name”:“count_became_NewCount”,“type”:“count”,“fieldName”:“count”} is equivalent to just {“name”:“count_became_NewCount”,“type”:“count”}.

But it’s still unexpected that they end up “0”. They should be at least “1” for any output row, since any output row should have at least one input row. Are you sure the columns are making it into the final segments? Do they show up in a segmentMetadata query?

Hmm, looks like this is working as intended, although it actually surprised me too. The hadoop reindexing mechanism is being sneaky. It doesn’t apply your metricsSpec to the segments as-is, it applies them in the “combining” form. This is nice I guess since it lets you use the same metricsSpec while reindexing as you would on your raw data. But it also means you can’t use the metricsSpec to define new aggregators. This would be useful, and sounds like it’s what you want, but it would be a new feature. The docs could also use some clarifications.

See also: https://groups.google.com/d/topic/druid-user/DvWt79oRjPs/discussion

Thanks Gian for coming back to us on that,

Just to be sure I understand correctly as the use case of reindexing to add new metrics is important to us.

although we compact the segments as described, we are planning to re-index from ORC files (using the 0.9.2 orc extension)

I suppose in that later case there is nothing to combine from, so we are not subject to the same limitation and can create new metrics as we see fit right?

Thanks

Yes, the behavior I talked about is special to reading from Druid dataSources and wouldn’t apply to “normal” files like ORC, etc. IMO, it’d be nice to have an option to treat Druid dataSources like normal files too.