Why is this query so slow? Can I make it faster?

We’re ingesting about 2 million events a day and. The data is around 94 MB a day with 40 dimensions and 14 metrics. We have this one query that’s really slow. It takes about 20s per query.

We have about 7 historical nodes and 4 broker nodes.

Historical: r4.4xlarge

Broker: c4.4xlarge

I’m not sure if we can make it any faster?

This is the query

{

“queryType”: “topN”,

“dataSource”: “firehose-web”,

“dimension”: {

“type”: “default”,

“dimension”: “phr”

},

“context”: {

“timeout”: 600000

},

“metric”: “pageviews”,

“granularity”: “all”,

“threshold”: 1,

“aggregations”: [{

“type”: “hyperUnique”,

“name”: “uniques”,

“fieldName”: “unique”

}, {

“type”: “filtered”,

“filter”: {

“type”: “selector”,

“dimension”: “_t”,

“value”: “pageview”

},

“aggregator”: {

“type”: “longSum”,

“name”: “pageviews”,

“fieldName”: “count”

}

}, {

“type”: “filtered”,

“filter”: {

“type”: “selector”,

“dimension”: “_t”,

“value”: “scroll”

},

“aggregator”: {

“type”: “longSum”,

“name”: “scrollDepth”,

“fieldName”: “valueSum”

}

}, {

“type”: “filtered”,

“filter”: {

“type”: “selector”,

“dimension”: “_t”,

“value”: “timespent”

},

“aggregator”: {

“type”: “longSum”,

“name”: “timespent”,

“fieldName”: “valueSum”

}

}],

“filter”: {

"type": "in",

"dimension": "_o",

"values": ["brand1", "brand2", "brand3", "brand4", "brand5", "brand6", "brand7", ... // about 20 brands]

},

“intervals”: [“2017-11-04/2017-11-05”]

}

``

The format of the data looks something like this.

_t, _o, id, _v
dimension, dimension, hyperunique, doubleSum

_t _o id, _v
pageview, brand1, id1, 1
timespent, brand1, id1, 2000
scroll, brand2, id2, 4000

``

And here’s my aggregators

new CountAggregatorFactory(“count”),

new HyperUniquesAggregatorFactory(“unique”, “id”),

new DoubleSumAggregatorFactory(“valueSum”, “_v”),

``

Thanks

I’m sorry I meant 2 billion events and phr is the canonical url. But the cardinality is around 100k

Two things:

1) Try removing the hyper unique aggregator and see how much of a
difference that makes. If it makes a significant difference, then you
likely need to organize your data differently. Some options are to
adjust segment granularities and partitions so that you only have 1
segment per day (if it's only 94MB/day then one segment should be
plenty). You can also start doing semantic partitioning by using the
dimension shard spec in batch (if you are doing only real-time
ingestion, you can re-ingest from the real-time segments using the
dimension shard spec to get better data organization).

2) It looks like you are ingesting your different metrics as different
rows: i.e. pageviews, scrolls, timespent. I'm assuming that those are
pretty well known metrics, in which case you could decrease your data
size even more by ingesting them as metric columns instead of as a
dimension name. The reason to use a dimension name to identify your
metrics would be if you don't have a priori knowledge of what metrics
will be coming in.

--Eric

Sorry I mistyped again it was 98GB not 94MB and I took the unique aggregator out and it made a difference. Can you explain what that would make a difference?

And what do you mean by shard spec? Do you have any documentations I can take a look? Also, I agree that I need to change my data organization.