Populating segment cache from deep storage for a new node

My team is currently using the Hadoop batch ingestion task to index segments and S3 for our deep storage. We have at the moment just a single historical node in a test Druid cluster.

In a “segment-cache-empty” historical node scenario, we are having difficulty correctly bootstrapping the segment cache based on our understanding of Druid’s load/drop rules. Our desired behavior is to keep the last n=3 days data in the segment cache. We have experimented with several variations of the rules in https://groups.google.com/forum/#!topic/druid-user/ve9hb1K6RV4, but in most cases not all of the expected segments are pulled from deep storage. For example, our current rule is “load P3D”, but I currently see only 20 hourly segments in our cache (expected 72 segments). We are not running into disk space issues and I see no obvious errors in the coordinator or historical logs.

Any suggestion on how to debug segment cache loading further?

Norbert

Hi,

are you sure that the historical has enough space to load all the segments, this is dependent on the druid property [druid.server.maxSize=XXXX].

``

Thanks Slim. Yes, druid.server.maxSize is set at roughly 75% of total disk space (and there is 30% free on that partition).

Norbert

HI folks, can anyone shed some further light on this issue? It seems like Druid should be able to automatically manage for us a P3D (last 3 days) local segment cache, but have not been successful in implementing this.

Norbert

Hi Norbet, I’m also not entirely sure what you mean by segment cache, but I assume you are talking about retaining data within a Druid cluster for 3 days. Also ensure that your timezone is in UTC.

In general, rules in Druid are used to configure a time to live, and cannot be used as is for reloading data that has already bee dropped in Druid. I recently added some more docs to clarify how to reload data that has been configured to be dropped: https://github.com/druid-io/druid/pull/3369

Thanks Fangjin. Probably my characterization of this as a “node bootstrapping” problem is confusing the question. Let me try again:

We have an hourly batch indexing job running on our Hadoop cluster generating Druid segments into S3 deep storage. What combination of load/drop rules should we implement on our Druid cluster to only keep the last 3 days’ worth of segments (P3D) on our historical node, and automatically purge any data that’s older?

We have unsuccessfully tried load=P3D and also load=P3D/drop=P3D.

In my mind, it seems architecturally desirable if the batch indexing operation is separate from the task of loading segments into historical nodes - in other words, indexing jobs only impact deep storage, and historical nodes transparently “discover” new data in deep storage and load them automatically. Is this incorrect?

Norbert

Hi Norbert,

The rules you need to set are: loadbyPeriod=P3D, dropForever

The idea behind the rules is that the rules are a list and every segment tries to match the first rule that applies to it. So in this case, segments in the first 3 days match the first rule, everything older matches the second.

Fangjin,

I am hitting the same issue. Druid 0.10.1, S3 as deep storage. I loaded some test datasource into the cluster, was able to query it right after. After a few days, the data is not in the cluster anymore. I changed the retention policy (1. loadByPeriod P30D, 2. dropForever). The time stamp on the test data is within the past week so this should cover the data fully. Next i queried the data source but the data is not getting loaded into the cluster. The dataSource was never in disabled state. I disabled and re-enabled it and again the same.

What am i missing?

Thank you.