Issue reindexing existing segments for hyperlogvlog

Hi,

I am reindexing segment to create hypervloglog metric using a dimension named userid. when I keep the dimension name in dimensions list and create metric using the field it works fine but if I remove the dimension from the dimension list then the matric “uniqe_user” has zero value. below is json for reindexing task.

It also gives memory timeout issue if I use two hypervloglog matrics for separate dimensions in this task. my average segment size is 200 MB (index.zip file ).

please help.

“dimensionsSpec”: {
“dimensions”: [
“segmentId”,
“userid”

                    ],
                    "dimensionExclusions": [
                        "timestamp"
                    ]
                },
                "format": "json"
            }
        },
        "granularitySpec": {
            "type": "uniform",
            "segmentGranularity": "hour",
            "queryGranularity": "hour"
        },
        "metricsSpec": [{
                "type": "count",
                "name": "count"
            },
            {
                "name": "value_sum",
                "type": "doubleSum",
                "fieldName": "value"
            },
            {
                "type": "hyperUnique",
                "name": "uniqe_user",
                "fieldName": "userid",
                "isInputHyperUnique": false,
                "round": false
            }

        ]
    },

Could be the same or related to https://github.com/druid-io/druid/issues/5277

Hi NK Druid,

If you are using hadoop indexing then Kyle is probably right. If you’re using the non-hadoop “index” task it might be something else going on.

It also gives memory timeout issue if I use two hypervloglog matrics for separate dimensions in this task. my average segment size is 200 MB (index.zip file ).

For this, try reducing maxRowsInMemory.

if you’re doing this using hadoop based task then https://github.com/druid-io/druid/pull/5294 should likely fix the problem . it would be officially available in 0.13.0 release.

It is not a hadoop based task and reducing maxRowsInMemor also didn’t work. I also have additional issue also…

  1. field name must exist in dimension field list when I am doing hyperlogvlog during re-indexing. task details is as above.
  2. If I use another hyperlogvlog metric along with “uniqe_user” then it gives me “Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded”
  3. I use longsum on “count” field on reindexed segment it doesn’t give me same value as before reindexing the segment. I use following json at query time.
    {“fieldName”: “count”, “type”: “longSum”, “name”: “sum__count”}

Please suggest.