Inner query w/ ThetaSketch and outer query w/ sum of thetas

Hello,

I'm using groupBy subqueries, first to get the uniqness of host_names daily grouped by the location, and then want to calculate the sum monthly of that particular dimension.

The count is a thetaSketch that I'm trying to sum later as you can see in the query bellow. But I'm getting the unknown type class I guess because I need to extract the estimate somehow first, but since I'm not using a post-aggregator, how can I do this?

Thanks!!!

Druid Error:

{'error': 'Unknown exception', 'errorMessage': 'Unknown type[class io.druid.query.aggregation.datasketches.theta.SketchHolder] for field', 'errorClass': 'io.druid.java.util.common.parsers.ParseException', 'host': None}
Query is:

{
    "queryType": "groupBy",
    "dataSource": {
        "type": "query",
        "query": {
            "queryType": "groupBy",
            "dataSource": "host_classification",
            "dimensions": [
                "location"
            ],
            "intervals": "2017-01-01/2019-01-01",
            "granularity": "day",
            "filter": {
                "type": "not",
                "field": {
                    "type": "selector",
                    "dimension": "location",
                    "value": null
                }
            },
            "aggregations": [
                {
                    "fieldName": "host_name_sketch",
                    "type": "thetaSketch",
                    "name": "host_name:count"
                }
            ]
        }
    },
    "dimensions": [
        "work_unit"
    ],
    "intervals": "2017-01-01/2019-01-01",
    "granularity": "month",
    "aggregations": [
        {
            "type": "count",
            "name": "count"
        },
        {
            "fieldName": "host_name:count",
            "type": "doubleSum",
            "name": "host_name:sum"
        }
    ]
}

Do you really want the sum of daily estimates? Perhaps you want the estimate of the union of the sets instead, right?

We run a job everyday that makes an inventory of hosts, hosts that died are removed, new ones added multiple times (because they can be used by more customers than one - customer is another dimension), etc. So everyday the hosts change. The idea was to get an average or an approximation to what the unique hosts are in a month.

The important thing to note is that if you do a thetasketch lets say for the month you can get more hosts than the actual capacity of the location (because names change and they do appear unique over a month period). So in doing a daily and then averaging the month I can get somewhat an approximate number?

Maybe there is a better way of doing this…

I guess a ‘set’ is a thetasketch cell? and the union is the addition of the cells (not the value per se)?

I am not sure I understand your use case. I presume the goal is, as you stated, to get “an approximation to what the unique hosts are in a month”.

Theta sketch represents (approximately) the set of distinct values it has seen (presented to it by calling update()). You build daily sets of hosts that might overlap to some extent (same hosts might be present in some of the sets), right? If so, adding the estimates will result in overestimating since you count them more than once. On the other hand, union of the sets will have no duplicates, and will produce a correct estimate of distinct hosts for the period of the union (month). This is the whole point of having Theta sketches (or other distinct counting sketches): they effectively transform non-additive distinct count metric into an “additive” one. The trade-off is that the metric is not a (small) number, but a larger binary blob, but it opens up the possibility of having a distinct count metric in a classic data cube.