I found there is a new feature “result-level broker caching” in druid-0.13.0, I concern a issue if cache in broker01, but the second same query is send to broker02, how do druid make the cache works? Thanks
It depends on the type of cache you would be using. As long as you’re using remote cache types like memcached or redis, both the brokers would be able to use the cache.
In case of local cache types, results cached by one broker would not be accessible by the other broker.
Hope this helps,
ok, understand, I have another question, if I use redis, the key of redis would be the query json, the result is value, is that right? if yes, I guess the hit rate is not very high because most queries are real-time query which means interval parameter is usually different. thanks a lot!
The cache key is computed from the query properties and does not include the interval parameter in the computation. Therefore, the same query executed multiple times with different intervals would use the cache if the query results are identical.
Thank you. But how does druid use the cache to response 2 queries with same query body but different interval?
The cache key is generated from query properties such as type of query, granularity, dimension filter and aggregator specs. As mentioned before, this key generation logic does not include the interval parameter for computation. Therefore two queries with same body but different interval would generate the same cache key.
yes, right, but if the interval is not same, the response result should be different, so how should druid return to client by cache value?
Druid uses the result cache data only if the query results from both queries are identical. So if you have queryA that produces query output resultA and then you have queryB with same body but different intervals, but produces resultB, the cache would not be used for queryB, since the results are different.
so the broker result level cache can’t reduce query loading of historical nodes since it needs to compare the result
Druid uses etags for comparing results. An etag is generated from the segment identifiers of the segments that needs to be queried upon. Therefore, etag generation and comparison doesn’t require the query to be actually executed.
Even if the same segments are involved in the query, how to guarantee the result is the same for different time intervals? Let’s say, the segment granularity is DAY, and query granularity is HOUR. When the queries have different hours interval of the same day, the same segment will be involved, but the query results could be different, right?
Sorry, I asked stupid questions. I got the point.
Sorry to bother again. I am still confused with the question
"Even if the same segments are involved in the query, how to guarantee the result is the same for different time intervals? Let’s say, the segment granularity is DAY, and query granularity is HOUR. When the queries have different hours interval of the same day, the same segment will be involved, but the query results could be different, right? "
I thought the relevant rows id are counted while generation the Etag. When I check the source code, it seems only segment identifier involved to generate.
ok, as you said that, etag is generated from the segment identifiers that needs to be queried upon, so as I understand that segment identifier is segment level Base64 value. For example there is a datasource segment granularity is DAY, query granularity is HOUR, there are 2 queries’ intervals have same day range but different hour range, the result should be not same, but there will be same segment identifier, how can I use cache to response?
Even if the segment granularity is DAY, there can be multiple shards and the segment identifier can vary accordingly which in turn would give you a different etag. But yes, if you have a large enough segment such that queries with different intervals use the same shard, it would generate the same etag.
so getting same shards for different queries mean same query result , is that right?