Questions about Druid cache and OS page cache

Hello guys!

I’m a Big Data Researcher and I’m exploring Druid since a few months ago.

At this moment, I’m benchmarking Druid and I’m observing that Druid query performance can benefit from the previous queries. Thus, I’m trying to understand the reasons.

I Know that Druid uses cache (I’m using cache in the Broker), but this cache just stores the result of the queries per segment (right?). **However, I have noticed that if the subsequent queries use the same segments, the performance improves. **

Example:

  1. Select sum(metric), dimteste2, dimteste3 from table x where dimteste=‘x’ group by dimteste2, dimteste3 -> 2 seconds

  2. Select sum(metric), dimteste2, dimteste3 from table x where dimteste=‘y’ group by dimteste2, dimteste3 -> 0.5 seconds

I searched and found that this behavior can be achieved by the OS page cache. Based on my research, I think that Druid, during the first query to the datasource, loads the necessary segments to memory (OS page cache). And the segments can be read faster in the next queries.

Am I right?

I looked in the Druid documentation and I was unable to find anything helpful.

This is really important to my study. Can you please give me some help explaining this awesome behavior?

Best regards,

José Correia

I did a similar question here.

It would be nice if you can confirm my explanation.

I think that the documentation isn’t clear enough regarding this topic.

Best regards,

José Correia

sábado, 30 de Junho de 2018 às 23:57:57 UTC+1, José Correia escreveu:

Hi,

You can find some info about this in the white paper http://static.druid.io/docs/druid.pdf (@ 4.2 Storage Engine)

But yes, it relies on OS page cache to keep hot segment in memory.

Best