Missing intervals after Hadoop ingestion

Hey,

After a few day-per-day hadoop ingestion tuning (using the CLI index hadoop), I decided to import my data month by month. No problem in the MapReduce phase, all went smooth, and data were properly generated for the whole month in HDFS:

1.4 K 2016-08-20 16:26 /tmp/hadoop_output/ds/20160701T000000.000Z_20160701T060000.000Z/batch-one-month/0/descriptor.json

284.6 M 2016-08-20 16:26 /tmp/hadoop_output/ds/20160701T000000.000Z_20160701T060000.000Z/batch-one-month/0/index.zip

Some segments were up to the GB (btw I forced numShards=1).

So, I’ve checked /druid/coordinator/v1/metadata/datasources/ds/segments and I can see all the new segments. Meaning they are just not loaded into Druid.
My historicals are at 40% percentUsed. My RT nodes are also waiting for handoff for segments now. Definitely something with the coordinator or historicals. I’ll keep looking.

I had a logging issue on the historical nodes, it’s now fixed. Thanks to this, i saw those errors:

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[xxx]

Caused by: com.metamx.common.ISE: Segment[xxx:1,085,728,955] too large for storage[/opt/druid-segment-cache:481].

The coordinator was asking the hist node to load it and displays a simple “done processing” in its logs, although the hist node denied the requests with those errors.

Anyway, druid.segmentCache.locations storage is sized at 200GB. I have 2 historical nodes, ie: 400GB.

Question: I don’t understand why the hist nodes can’t just drop some of its content to load queried data ? (it’s a cache right ? data are already on hdfs)

All right, I think I get it now. Druid won’t drop the data by itself. I need to add some drop rules if I want to.
But if I want to be able to query a whole year of data with only 2 historical nodes for instance, I need to have either a lot of space for druid.segmentCache.locations, or I have to reduce the granularity of the segments (for them to take less size, so they can be loaded on the historical nodes).

If I’m wrong, please correct me.

Anyway, thanks, and sorry for the monologue. :slight_smile:

Hi Stephane,

Do you have enough capacity to load all the segments into Druid? Are there any exceptions about why certain segments may not be able to be loaded?

Historical nodes must download segments locally in order to serve them. Downloading on-demand from deep storage introduces too much overhead.