We need to create a Druid datasource for interactively analyzing incoming traffic. We have customer requests hitting our systems and each request breaks up into a set of customer services rendered.
For each filter expression, we need to be able to count the number of services rendered and also the number of requests. The following table depicts the data model.
As Druid can only handle flat data structures, we flattened our data into records on service level meaning that Druid ingests one record per service, not one record per request.
This lead us to the issue of how
to count the number of requests, which we handled by having the service-level records contain a unique request-ID which we pass into a hyperloglog unique count which gave us at least an estimate of the number of requests and does the right thing regardless of wichever filter expression a user clicks together.
The issue we now ran into is that we need to let our users perform ad-hoc queries over time ranges of several weeks or even months and we found that queries that are over a hyperloglog count are exremely slow (i.e. a single query takes several minutes to complete).
Now we need a way to fix this somehow and I would like to ask for help and exchange of ideas.
idea we have is that generally speaking a unique count is a count that must also be able to de-duplicate and this de-duplication must work even
for records that are weeks apart. We have some
of these unique counts for example to count the number of unique users. Obviously, two records that are weeks apart can pertain to the same user and a unique count should only increment by one.
However, for our case, we only need Druid to de-duplicate records within a very small window, basically
only within a segment-partition. If two records are an hour apart in time, they cannot belong to the same request anyways.
So one question would be if it is possible to advise Druid to unique-count within a partition or even a smaller scope and then use a normal count when merging intermediate results and whether it would be reasonable to assume that this might lead to a faster query performance.
would it make sense to use a theta-sketch? Are theta sketches faster than hyperloglogs in situations where faster query latency matters more than resulting data-volume?
Could theta-sketches help to model the stratified, combined unique- and normal counting.
Is there any other way to handle the use-case described?
Any help is much appreciated.