Getting distinct values (problem with search queryType)

Hello,

My druid version is 0.7.0

I store in druid user events of different types. First step of my analysis is to get unique cookies of users who performed some specific actions (filtering by event_type and other dimensions related to this event_type, eg. event_type == “page-view” && domain == “domain.com”).

To measure how many cookies should I expect, I use timeseries query with cardinality aggregation:

{
“queryType”: “timeseries”,
“dataSource”:“reports_v20150517_4”,
“granularity”: “all”,
“filter”: {
“fields”:[{
“dimension”:“event_type”,
“value”:“page-view”,
“type”:“selector”
},{
“dimension”:“domain”,
“value”:“domain.com”,
“type”:“selector”
}],
“type”:“and”
},
“aggregations”: [
{
“type” : “count”,
“name” : “count”
},
{
“type” : “cardinality”,
“name” : “distinct_cookies”,
“fieldNames” : [“cookie”]
}
],
“intervals”: [“2015-05-18/2015-05-19”]
}

``

which returns:

[ {
“timestamp” : “2015-05-18T00:00:00.474Z”,
“result” : {
“count” : 234952,
“distinct_cookies” : 27527.74941333171
}
} ]

``

Select query (with the same filter) returns indeed 234952 records which contains 27487 unique cookies.

But when I run a corresponding search query I get only 11325 cookies:

{
“searchDimensions”:[“cookie”],
“filter”:{
“fields”:[{
“dimension”:“event_type”,
“value”:“page-view”,
“type”:“selector”
},{
“dimension”:“domain”,
“value”:“domain.com”,
“type”:“selector”
}],
“type”:“and”
},
“query”:{
“type”:“insensitive_contains”,
“value”:""
},
“limit”:1000000,
“queryType”:“search”,
“dataSource”:“reports_v20150517_4”,
“granularity”:“all”,
“intervals”:[“2015-05-18T00:00:00/2015-05-19T00:00:00”]
}

``

My first question is: what queryType should I use to get unique cookies fulfilling a filtering condition (in my case users who visited the given domain)?

Should I expect that search query should return distinct cookies? Why I get only a subset of select query results (not all unique cookies from select query)?

Any help would be appreciated.

Best,

Tomek

Search has a hard internal limit of 10,000 things that it will return.
This is to protect the memory. It's not really intended as a method
of getting at a large list of things. It was more intended as a way
of providing auto-complete features.

If you really want everything, you'll have to do a groupBy query and
have your max rows limit set high enough to handle the number of
things you want to return. In general, however, Druid's sweet spot is
in aggregating over lots of things and not necessarily returning
million or billion row result sets.

--Eric

Hi Eric,
Thank you for your response.

Good to know that I shouldn’t use search query for my case. I’ll use groupBy as you suggested.

However, you mentioned about limit of 10K results for search queries, but I observed two strange situations:

  • some queries return more than 10K results
  • some queries return less than 10K even though there exist more unique values than 10K
    Maybe I don’t understand what is the exact semantic of search query, could you explain me that?

Best,

Tomek

The greater than 10k results is likely because of how we apply the 10k
limit. IIRC, it's only applied at the historicals so when the broker
merges, it can end up with a set larger than 10k.

Getting less than 10k when there are more than 10k and you are asking
for more than 10k does seem questionable. My first guess would be
that a filter set is aplied which puts the number below 10k?

--Eric