Slow query performance on filtering based on regex

Hello,

I am currently using
Druid-0.8.0. Following is the query that I’m firing in order to query from one
segment’s data. :

{

“queryType”:
“timeseries”,

“dataSource”:
“sample”,

“dimension”:
“sample_dim”,

“granularity”:
“all”,

“intervals”:
[“2016-07-23T14:00:00.000Z/2016-07-23T14:59:59.999Z”],

“filter”:
{

“type”:
“and”,

“fields”:
[{

“dimension”:
“sample_dim”,

“pattern”:
“.:588425;.:41104;.*”,

“type”:
“regex”

},
null]

},

“aggregations”:
[{

“name”:
“count”,

“fieldName”:
“sample_dim”,

“type”:
“count”

}]

}

The data is
segmented in hours and the above segment being queried has a size of 235MB (1 shard) and comprises of 36 dimensions and 1,119,831 records

The
response is taking 143172ms. However, on removing the filter parameter, the
result gets fetched in 23181ms. This
could mean that the query is particularly consuming time during filtering the
data based on the regex.

Further, I tried running the same query on the latest version of druid and it takes around ~60 seconds which is still considerable. Following are the emitted metrics from the latest druid version:

“metric”:“query/segment/time”,“value”:59901

“metric”:“query/segmentAndCache/time”,“value”:59902

“metric”:“query/wait/time”,“value”:2

“metric”:“query/cpu/time”,“value”:59294775

Could you shed some
light on why the performance is taking a hit in the 0.8.0 version and what can be done to
optimize it, other than the possible solutions of upgrading the version and
enabling roaring bitmaps for compression?

Hey Asra,

As you found out, regex filter performance has improved since 0.8.0, and upgrading would help. If the filter matches a lot of distinct values then switching to roaring should help too. It tends to be much faster than concise for filters that hit a lot of distinct values.

Another thing you could try, that may or may not be useful, is multi-value dimensions. Your regex “.:588425;.:41104;.*” looks sort of like you’re trying to match for values that contain both “588425” and “41104”. You could consider storing those as multi-value dimension rather than a delimiter-separated single-value dimension. Filtering on a multi-value dimension is much faster than filtering using a regex. The semantics are a bit different though (ordering is not preserved) so double-check first if they meet your needs. Docs are here: http://druid.io/docs/latest/querying/multi-value-dimensions.html

Hey Gian,

Thanks for getting back on this. But since my use case is centered around page-flow analytics, we need the numbers to be in order and hence multi-value dimension won’t meet our needs.

I tested with enabling roaring on 0.9 version of druid. Although the results were better as compared to druidv0.8 and concise compression but the query still ended up taking around 153402ms for just one hour’s segment which is still pretty huge. Any suggestions from your end on how can I can optimise the regex based filtering or if any other approach can be adopted

Hi Asra, what is the query you are issuing. Those times seem extremely slow, even for a regex filter.

Generally, making big dimension is not a good idea for druid. https://github.com/druid-io/druid/pull/2541could be a starting point for fixing this.

Hey Fangjin,

Following is the query I’m issuing:

{

“queryType”: “timeseries”,

“dataSource”: “sample”,

“dimension”: “sample_dim”,

“granularity”: “all”,

“intervals”: [“2016-07-23T14:00:00.000Z/2016-07-23T14:59:59.999Z”],

“filter”: {

“type”: “and”,

“fields”: [{

“dimension”: “sample_dim”,

“pattern”: “.:588425;.:41104;.*”,

“type”: “regex”

}, null]

},

“aggregations”: [{

“name”: “count”,

“fieldName”: “sample_dim”,

“type”: “count”

}]

}

Hi Asra, thanks.

Seeing as this is a fairly simple timeseries query, how many rows are you actually scanning? As in, if you remove the filter, how fast does the query respond? It seems like there may be other configuration settings that are impacting the performance beyond just the filter.

On removing the filter, the query just takes around 1-2 seconds. And the average number of rows scanned for an interval of one hour is in the range of 1-1.5 millions

One thing to look at is that you are writing your regexes in the best possible way. Druid uses the builtin Java regex engine – check out https://www.loggly.com/blog/five-invaluable-techniques-to-improve-regex-performance/ for some tips on optimizing for that.

If your matching problem is not amenable to being done quickly with regexes, you may get better performance out of the javascript aggregator (and writing a little javascript function to do the checking).