Druid Cuboids / query rewriting ?

I would like to discuss options / strategies for optimizing Druid query performance. Let me first briefly describe our setup:

We are trying to deploy a Druid cluster that takes in roughly 100TB of compressed data per day. The Druid segments are having a segment- and query-granularity HOUR and the compressed segment sizes in S3 are about 50GB per day. We have about 30 dimensions and 30 measures in our main datasource at a rollup ratio of 22. The datasource gets served up by 20 4xlarge historicals and a 4xlarge broker. Even when we only send one query at a time, the responses are too slow (15 to 30 seconds if we query more than a couple of days).
We’ll be looking into the metrics and be playing around with some options but I meant to ask a couple of questions with regard to query performance optimization to know which strategies might yield the biggest speedup.

I read up on the possibilities of Druid routers and tiers and having different query granularities. However, I cannot help but thinking that the most speedup in query performance should be due to having a better rollup ratio. If so, it would seem to make a whole lot of sense if one could construct a set of peer datasources such that for example a main datasource has all 30 dimensions and one peer datasource would be based on the the same data but have only a carefully selected subset of dimensions, lets say the 10 most used dimensions.

This peer datasource would have a much better rollup ratio and would probably be able to serve queries much faster without any unwanted tradeoff like having a higher time-granularity. This would be a similar concept as precomputing a partial cube in OLAP or precomputing aggregation groups in RDBMS systems. I know that the ingestion part of such a setup could be done via re-indexing tasks which would re-ingest existing druid segments and re-index them on fewer dimensions. However, as far as my understanding goes, the segments that now have a reduced number of dimensions would need to go into a separate datasource because a given datasource cannot serve up multiple granularities of the same time-frame.

Given then, that the dimension-reduced data is residing in peer-datasources, it would be required to have a functionality in Druid which checks incoming queries and rewrites them as follows: if a query only concerns the subset of dimensions that the peer datasource contains, route the query to this peer-datasource, otherwise route the query to the full datasource. In this process, the datasource name of the query might have to be rewritten while the rest of the query could remain unchanged.

Using a different tier for each datasource seems wasteful to me and I also wouldn’t know how to do the routing / query rewriting.

So I guess my questions are:

  • does something like this make sense

  • is it possible to do something like this with Druid’s out-of-the-box featureset?

Hi sasha,
the idea of having multi data sources seems cool, the broker does not support the query rewriting out of the box, but i think it exposes the necessary information which can be used to write an app to rewrite the queries on top of that.

Having said that, 15-30 seconds for a few days of data is quite a high query time.
I believe the slowness might be due to sub optimal configurations.

I recommend having a look at http://druid.io/docs/latest/operations/performance-faq.html to set your jvm configs.

Do you keep all data in memory or is memory paging slowing down the queries ?

Also, Have you tried query caching at all ?

FWIW, druid brokers and historicals have the ability to cache the intermediate segment results in either a local cache/ distributed memcache.

In general as you run queries over your segments and warm up the cache, queries for the same set of dimensions will hit cache the next time and will way more be faster.

Hi Sasha,

I have thought about similar things. I plan to support this in a way in Plywood. I have a proof of concept running (not open source but it is about 20 lines of code with the Plywood query intelligence APIs).
This might be cool in a tool like Pivot or a UI in general.

V

Thanks to both of your for your feedback! Much appreciated.

The plywood prototype sounds very interesting. We are also using Pivot and have customized it quite a bit.
Do you plan to open-source the proof of concept once it’s done?

thanks again,
Sascha

My plan is to work it into Plywood (and Pivot) once it is done.