Filtering by a large number of values in Druid

We have a use-case where we need to filter our dataset with a “user segment”.

For example, if this was our main dataset:

{“userId”: “123”, “dimension1”:“xxx”, “dimension2”: “yyy”, “metric1”:5}

{“userId”: “456”, “dimension1”:“xx1”, “dimension2”: “yy1”, “metric1”:2}

{“userId”: “123”, “dimension1”:“xx2”, “dimension2”: “yy2”, “metric1”:4}

{“userId”: “789”, “dimension1”:“xx3”, “dimension2”: “yy3”, “metric1”:5}

{“userId”: “789”, “dimension1”:“xx4”, “dimension2”: “yy4”, “metric1”:5}

And we have another table, say “user-segment”:

{“segmentId”: “1”, “userId”:“123”}

{“segmentId”: “1”, “userId”:“456”}

{“segmentId”: “1”, “userId”:“872”}

We have a requirement where we need to apply/filter a segment of users on our main dataset.

For segmentId: 1, we have userId 123,456 and 872, so my result should be:
{“userId”: “123”, “dimension1”:“xxx”, “dimension2”: “yyy”, “metric1”:5}

{“userId”: “456”, “dimension1”:“xx1”, “dimension2”: “yy1”, “metric1”:2}

{“userId”: “123”, “dimension1”:“xx2”, “dimension2”: “yy2”, “metric1”:4}

In the above example I only have 3 users, however, this list can run into the millions.

Is there a recommended way to filter by a large number of userId’s, without significantly affecting query performance?

Hi Kiran,

I’m unsure off-hand what the upper limit on number of selector filters that can be combined with an or filter is, but I imagine pushing into the millions might be a breaking point.

If approximate filters are acceptable for your use case, using the bloom filter extension might be your best bet. It was introduced in Druid 0.13 with a query filter which can be constructed externally using Java and attached to a query as a base64 encoded strings to filter very large numbers of values. The upcoming 0.14 release expands on this with adding a bloom filter aggregators which allows bloom filters to be constructed from Druid results, which can then be used as a sort of manual semi-join to use as a filter for further queries. 0.14 also adds Druid SQL support, see for documentation for the upcoming additions, and for the docs of the current released version.

When the filter gets into the millions of values range, it can slightly impact performance, as the overhead of hashing all of the dimension values to test against the bloom filter can get quite expensive, but it does at least allow these style of queries to be executed.