Poor performance on regex search

Hey,

I am currently
working on the use case of fallout reports where in I am seeking to determine
that for a set of 2 or more pages how many users visited one from another. I am
using druid 0.9.1.1 (with roaring ) as
my backend to store the related data.

The data that I am
indexing contains a dimension which states the pages the user visited in a
particular session, delimited by a colon. For eg. For a particular session the
‘page_path’ dimension will be equal to p1:p4:p7:p3:p2 (where p1… pn are page
names). This data is indexed in hourly segments with a size range of
200MB-400MB

In order to query
the data, I am doing a regex search. For a use case where I need to find out
the count of users who went from p1 to p3, I do a regex search on the dimension
‘page_path’ with the value p1::p3*

However, querying
for a small interval (1-5 hours) of data
is taking considerable amount of time (of the order of 3-5 minutes) while the
query is timing out when the interval is 3-4 days or greater.

The query that I am
using is of the following format:

{

“queryType”:
“groupBy”,

“dataSource”:
“sample”,

“dimension”:
“page_path”,

“granularity”:
“all”,

“intervals”:
[“2016-07-23T14:00:00.000Z/2016-07-23T14:59:59.999Z”],

“filter”:
{

“type”:
“and”,

“fields”:
[{

“dimension”:
“page_path”,

“pattern”:
“.:588425;.:41104;.*”,

“type”:
“regex”

},
null]

},

“aggregations”:
[{

“name”:
“count”,

“fieldName”:
“page_path”,

“type”:
“count”

}]

}

I have tried with
timeseries instead of groupBy as well but to no benefit. I have also raised the
question under the following topic: https://groups.google.com/forum/#!searchin/druid-user/asra|sort:relevance/druid-user/IhAEnJEKJew/LIthQ5TNAAAJ but started a new discussion so as to lay a
better context of the problem at hand.

It would be great if
someone from the druid team could have a
look at this use case to see if druid has the capability support this use case.
If yes, how could I optimise the query?

Hi Asra,

Druid 0.9.2 will provide support for an entirely new groupBy engine that is benchmarked to be 5-10x faster for many queries.

In terms of your perf problem, I was wondering how long a simple scan query takes (a timeseries query with a count aggregator)?

Hello Asra,

Isn’t it possible to use multi-value dimensions for your case?

http://druid.io/docs/latest/querying/multi-value-dimensions.html

Hey Carlos,

My understanding is that multi-value dimensions do not preserve the order. However, I need to have the page names indexed in the order in which they were visited. Correct me if I’m wrong

You are right, multi-value dimensions do not preserve order, so it doesn’t sound like they’re useful here.

One thing to look at is that you are writing your regexes in the best possible way. Druid uses the builtin Java regex engine – check out https://www.loggly.com/blog/five-invaluable-techniques-to-improve-regex-performance/ for some tips on optimizing for that.

If your matching problem is not amenable to being done quickly with regexes, you may get better performance out of the javascript aggregator (and writing a little javascript function to do the checking).

The javascript filter being faster than regex is probably due to the fact that it is actually really easy to write inefficient regexes due to backtracking. The loggly blog post goes into some more details about that.