I would like to discuss options / strategies for optimizing Druid query performance. Let me first briefly describe our setup:
We are trying to deploy a Druid cluster that takes in roughly 100TB of compressed data per day. The Druid segments are having a segment- and query-granularity HOUR and the compressed segment sizes in S3 are about 50GB per day. We have about 30 dimensions and 30 measures in our main datasource at a rollup ratio of 22. The datasource gets served up by 20 4xlarge historicals and a 4xlarge broker. Even when we only send one query at a time, the responses are too slow (15 to 30 seconds if we query more than a couple of days).
We’ll be looking into the metrics and be playing around with some options but I meant to ask a couple of questions with regard to query performance optimization to know which strategies might yield the biggest speedup.
I read up on the possibilities of Druid routers and tiers and having different query granularities. However, I cannot help but thinking that the most speedup in query performance should be due to having a better rollup ratio. If so, it would seem to make a whole lot of sense if one could construct a set of peer datasources such that for example a main datasource has all 30 dimensions and one peer datasource would be based on the the same data but have only a carefully selected subset of dimensions, lets say the 10 most used dimensions.
This peer datasource would have a much better rollup ratio and would probably be able to serve queries much faster without any unwanted tradeoff like having a higher time-granularity. This would be a similar concept as precomputing a partial cube in OLAP or precomputing aggregation groups in RDBMS systems. I know that the ingestion part of such a setup could be done via re-indexing tasks which would re-ingest existing druid segments and re-index them on fewer dimensions. However, as far as my understanding goes, the segments that now have a reduced number of dimensions would need to go into a separate datasource because a given datasource cannot serve up multiple granularities of the same time-frame.
Given then, that the dimension-reduced data is residing in peer-datasources, it would be required to have a functionality in Druid which checks incoming queries and rewrites them as follows: if a query only concerns the subset of dimensions that the peer datasource contains, route the query to this peer-datasource, otherwise route the query to the full datasource. In this process, the datasource name of the query might have to be rewritten while the rest of the query could remain unchanged.
Using a different tier for each datasource seems wasteful to me and I also wouldn’t know how to do the routing / query rewriting.
So I guess my questions are:
does something like this make sense
is it possible to do something like this with Druid’s out-of-the-box featureset?