I am currently
working on the use case of fallout reports where in I am seeking to determine
that for a set of 2 or more pages how many users visited one from another. I am
using druid 0.9.1.1 (with roaring ) as
my backend to store the related data.
The data that I am
indexing contains a dimension which states the pages the user visited in a
particular session, delimited by a colon. For eg. For a particular session the
‘page_path’ dimension will be equal to p1:p4:p7:p3:p2 (where p1… pn are page
names). This data is indexed in hourly segments with a size range of
In order to query
the data, I am doing a regex search. For a use case where I need to find out
the count of users who went from p1 to p3, I do a regex search on the dimension
‘page_path’ with the value p1::p3*
for a small interval (1-5 hours) of data
is taking considerable amount of time (of the order of 3-5 minutes) while the
query is timing out when the interval is 3-4 days or greater.
The query that I am
using is of the following format:
I have tried with
timeseries instead of groupBy as well but to no benefit. I have also raised the
question under the following topic: https://groups.google.com/forum/#!searchin/druid-user/asra|sort:relevance/druid-user/IhAEnJEKJew/LIthQ5TNAAAJ but started a new discussion so as to lay a
better context of the problem at hand.
It would be great if
someone from the druid team could have a
look at this use case to see if druid has the capability support this use case.
If yes, how could I optimise the query?