Problems with "longMax"

Hello,

We’re testing a prototype piece of software at Inbenta, and we’re having some problems with a simple “max” aggregation. We suppose we’re doing something wrong, but since now we didn’t find a solution. It seems that the aggregator is doing sums rather than computing a max value, or a combination of both, because the sum aggregation shows us a much higher value.

This is our “spec”:

[
    {
        "dataSchema": {
            "dataSource": "sessions",
            "parser": {
                "type": "map",
                "parseSpec": {
                    "format" : "json",
                    "timestampSpec" : {
                        "column" : "start_ts",
                        "format" : "millis"
                    },
                    "dimensionsSpec" : {
                        "dimensions": ["result", "profile", "actions", "tags", "extra_data"],
                        "dimensionExclusions" : [],
                        "spatialDimensions" : []
                    }
                }
            },
            "metricsSpec": [
                {
                    "type" : "count",
                    "name" : "sum_sessions"
                },
                {
                    "type" : "hyperUnique",
                    "name" : "sum_unique_sessions",
                    "fieldName": "id"
                },
                {
                    "type": "longSum",
                    "name": "sum_queries",
                    "fieldName": "sum_queries"
                },
                {
                    "type": "longSum",
                    "name": "sum_clicks",
                    "fieldName": "sum_clicks"
                },
                {
                    "type": "longMax",
                    "name": "max_queries",
                    "fieldName": "max_queries"
                },
                {
                    "type": "longMax",
                    "name": "max_clicks",
                    "fieldName": "max_clicks"
                },
                {
                    "type": "longMax",
                    "name": "max_duration",
                    "fieldName": "max_duration"
                },
                {
                    "type": "longSum",
                    "name": "sum_duration",
                    "fieldName": "sum_duration"
                }
            ],
            "granularitySpec": {
                "type": "uniform",
                "segmentGranularity": "DAY",
                "queryGranularity": "fifteen_minute"
            }
        },
        "ioConfig": {
            "type": "realtime",
            "firehose": {
                "type": "receiver",
                "serviceName": "sessionsReceiver",
                "bufferSize": 16384
            },
            "plumber": {
                "type": "realtime"
            }
        },
        "tuningConfig": {
            "type": "realtime",
            "maxRowsInMemory": 50000,
            "intermediatePersistPeriod": "PT10m",
            "windowPeriod": "PT20m",
            "basePersistDirectory": "/tmp/realtime/basePersist",
            "rejectionPolicy": {
                "type": "serverTime"
            }
        }
    }
]

We’re sending our counters in a redundant way to avoid mixing the “max” and “sum” aggregations. All of our counters are below 20, but we’re seing values beyond 1000s in the final “max” aggregations.

Maybe we’ve misunderstood what does the “longMax” aggregation? Or we’re parametrizing it the wrong way in our schema spec?

Thank you for your time!

What query are you using when you query this data and see results that you’re not expecting?

Hello Gian,

The “timeseries” query works OK. But Pivot not, I don’t know which query is done under the hoods in Pivot. Know I’m seeing that maybe I should be asking about Pivot and not about Druid.

Thank you in any way.

P.D: This is one query:

curl -H ‘content-type: application/json’ -XPOST “http://127.0.0.1:8082/druid/v2/” -d ‘{“queryType”:“timeseries”,“dataSource”:“sessions”,“granularity”:{“type”:“period”,“period”:“P1D”,“timeZone”:“UTC”},“aggregations”:[{“type”:“longMax”,“fieldName”:“max_clicks”,“name”:“max_clicks”},{“type”:“longMax”,“fieldName”:“max_queries”,“name”:“max_queries”},{“type”:“longSum”,“fieldName”:“sum_sessions”,“name”:“sum_sessions”},{“type”:“longSum”,“fieldName”:“sum_clicks”,“name”:“sum_clicks”},{“type”:“longSum”,“fieldName”:“sum_queries”,“name”:“sum_queries”}],“intervals”:[“2016-03-01T00:00:00.0/2016-03-03T00:00:00.0”],“postAggregations”:[{“type”:“arithmetic”,“name”:“avg_clicks”,“fn”:"/",“fields”:[{“type”:“fieldAccess”,“name”:“sum_clicks”,“fieldName”:“sum_clicks”},{“type”:“fieldAccess”,“name”:“sum_sessions”,“fieldName”:“sum_sessions”}]},{“type”:“arithmetic”,“name”:“avg_queries”,“fn”:"/",“fields”:[{“type”:“fieldAccess”,“name”:“sum_queries”,“fieldName”:“sum_queries”},{“type”:“fieldAccess”,“name”:“sum_sessions”,“fieldName”:“sum_sessions”}]}]}’

with the following result

[{“timestamp”:“2016-03-02T00:00:00.000Z”,“result”:{“sum_clicks”:321176,“avg_clicks”:3.21176,“sum_sessions”:100000,“avg_queries”:6.64251,“sum_queries”:664251,“max_queries”:17,“max_clicks”:13}}]

Meanwhile, Pivot is showing me something like:

Unless you are running Druid 0.9.0_RC1 Pivot can not know that your MAX aggregator is indeed a max. It just assumes it is a SUM unless you tell it otherwise in the config.

Are you running Pivot with a config? If not your first step is to generate one:

pivot --druid your.druid.broker.host:8082 --print-config --with-comments > config.yaml

(See Pivot readme)

Next you should edit the config.yaml file, look for you measure and adjust it accordingly. Look for $main.sum($max_clicks) and change it to $main.max($max_clicks) e.t.c.

Then run:

pivot --config config.yaml

If you have any more Pivot questions please head over to: https://groups.google.com/forum/#!forum/imply-user-group

Best regards,

Vadim

Thank you Vadim