Search query on user_ids

I am trying to see if we an use Druid to return a list of user_ids that meet certain criteria.

I have issued a timeseries query with specific filtering criteria on a single day and determined that there are 11M+ user_ids. However when I run a search query, with search dimension as user_id and same filters/interval as time series I only get 70K results.

{

“queryType”: “search”,

“dataSource”: “mydatasource”,

“granularity”: “day”,

“searchDimension”: [“user_id”],

“query”:{“type”:“insensitive_contains”, “value”: “-”},

“filter”: {

“type”: “and”,

“fields”: [

{ “type”: “selector”, “dimension”: “condition1”, “value”: “abc” },

{ “type”: “selector”, “dimension”: “condition2”, “value”: “def” }]},

“intervals”: [ “2015-12-12/2015-12-13” ]

}

We have set druid.query.search.maxSearchLimit to 2500000, this increased results from 1K to about 70K. Are there other setting we can change to get the full 11M+ user_ids?

Hi, are there 11M+ unique user ids? If the user IDs are not unique, Druid will roll up your data. One other thing to consider is that we strongly recommend using hyperUnique or theta sketches where possible for highly unique dimensions, as storing every unique user ID will murder your rollup and you will end up storing much more data than you have to (potentially 100x more data than you have to).

Yes there are more than 11M+ unique user ids. I found in another thread that group by is the only way to get the exact list of user ids. We were exploring using this as a distribution tool if it could also give a list of users that met certain criteria.

Hi,

Druid undergoes rapid development so posts can become outdated quickly. I think the easiest way to reason about the problem you want to solve is to think about the SQL query you want to issue and use a tool like plyql https://github.com/implydata/plyql to see what the corresponding Druid query would be. I don’t think using the search query is going to give you the count distinct query I think you are actually trying to make. With 0.8.3, Druid offers theta sketches, which can given exact counts below certain thresholds, and approximate counts beyond them. Including every unique user ID in a dimension can really hurt your rollup (leading you to store 100x more data than you have to, with 100x slower performance and cost).

If you absolutely need exact counts (I would really love to understand more about why), groupBy is the best way to go as described in other threads.

I’m not looking for counts, but actual user_ids. I found group by is the only solution to get each individual user_id when the list exceeds the 10,000 limit threshold.

Will

Ah, I see, my mistake. You can also look into using topN queries as they can be paginated and are generally much faster than groupBys for single dimension Group Bys.

One common way of returning all values in a dimension is to use a lexicographic topN and paginate through the results.

Sounds good, I will look into this. Are you aware of any limitations in number of results topN will generate? Can it exceed 15M?

There’s no limit because topNs can be paginated. They are actually much better than groupBys in that sense because each query can be a few thousand results and not eat up too many cluster resources. Out of curiosity, are you feeding the results to another system? I can’t imagine being able to visualize 15M results.