Pre-computing queries with druid

Hi,

Its given that druid is fast at computing queries. But I am sure it wouldn’t be faster than querying something thats already pre-computed. So if we have a query that we know will be used later, why not have druid pre-compute the results for that query during ingestion using async jobs? There are two cases for pre-computation:

  1. incoming new data

  2. existing data

If we don’t know the query before-hand and if there is data already in druid, then pre-computing becomes slightly tricky but not impossible. If the data already exists and when we do know the query, why not bootstrap the pre-computation by running a query against druid at that point and storing the results of that in druid itself in a format that can be appended to during ingestion of new data? If there are filters in the query, then these need to be applied for the incoming data during ingestion in both of the above cases including bootstrapping.

Count-distinct queries are even more tricky but again not impossible. Druid provides an ability to extract theta-sketches out of it using {“context”: {“finalize”: false}}. So we can use that during bootstrapping to extract theta-sketches for a given query, and then use those during ingestion of new data to add to the bootstrapped data.

Do you guys see any problems with this approach?

Thanks,

Arjun

Arjun,

The way we generally do that is with query caching. The thing I think we need to add to enhance the caching that already exists is to add a result cache. The current caching mechanism stores intermediate results in order to allow for reusing the cache at the intermediate layers, but the intermediate size of things like theta sketches is generally too large to leverage the cache. So doing a result level cache should be useful even in those cases, though not mergeable.

–Eric

Hi Eric,

Consider queries with moving time windows on very large data-sets (billions of events). By moving time-windows, I mean queries whose starttime and endtime increase by a fixed increment every day. Say queries that show results for last 30 days etc. New data is constantly coming in and old data needs to be expired. Will the caching mechanism in druid be able to help with this?

Thanks,

Arjun

Absolutely, that’s exactly what the caching mechanism was built for. If you were to implement a result cache as well, I would recommend that it store results individually for each time grain so that they can be reused.

–Eric

Is there more documentation on how that is done? This was not very helpful: http://druid.io/docs/latest/querying/caching.html

Arjun

So does that mean that Druid maintains a list of queries and computes the queries for new segments as and when new segments are created? That seems very expensive. Is Druid actually doing that?

Arjun

It doesn’t do that. The cache is populated when a query actually happens, rather than in advance. So the very first instance of a query is done “live” against uncached data, and subsequent queries are done against the cache. If new data comes in, then the cache for that query is invalidated, and the next instance of that query will again behave as if it were uncached.