Hadoop indexing very slow, need tips

I’m trying to index one day of data using the hadoop index task. The way I’m doing it seems to be very slow. It takes over 1 hour to index one day of data. Does anyone have any tips on what to improve?

I’m using a custom build of druid 0.9.0 with some changes to make it work with Google Dataproc: https://github.com/atomx/druid/commit/3564de516cf6932ba8f1d5a1f02ff0ab2330c30e

The hadoop cluster consists of 5 machines, each with 8 cores and 30 GB memory (40 cores and 120 GB usable memory total).

My index tasks looks like this: https://gist.github.com/erikdubbelboer/b17d24b513c6233747c7

I’m setting numShards set to 0 to force a NoneShardSpec so the Coordinator can merge segments. Is this smart? Is merging a lot of small segments faster when querying?

The input consists of 405 gzip compressed csv files.
Compressed this is 1.6GB.
Uncompressed this is 15.7GB.
It’s a total of 94,369,283 lines.

The files themselves are sorted ascending on the time and only contain one hour of data. There can be up to 20 files covering the same hour.

As you can see from the task I have 24 dimensions (most with very low carnality).
Only 7 metrics (only count and doubleSums).

The produced output is only 50MB (as I said, very low carnality).

I noticed the reduce step takes by far the longest. It needs 15GB of memory and only seems to use 1 core (I guess because of the numShards).

What exactly does segmentGranularity do, and how would it affect indexing and querying? As you can see I have currently set it to DAY, I notice setting this to HOUR produces 24 segments and is slightly faster. I guess because it can use 24 reducers. Would generating 24 segments instead of one affect query time? Right after this task was finished the coordinator decided to merge these 24 segments into one. Is this one segment the same as one that would be generated by the DAY granularity task? Keep in mind queryGranularity is still HOUR in both cases.

Hey Erik,

There will be one reducer per generated Druid segment. The idea is that if you set segmentGranularity = DAY and targetPartitionSize = 3M, and your job is going to generate 9M rows per day, that’s 3 segments and so you’ll get 3 reducers for that day. For most folks this is enough to keep the amount of data per reducer reasonable. But I could see it not being good enough if your pre-aggregation is actually really effective (like hundreds or thousands of input rows going into each Druid row).

It sounds like this is the case you’re in, as your dataset is crunching down from 15GB to 50MB. So some things you could do are:

  • Change segmentGranularity from DAY to HOUR. This will get you 24 reducers per day instead of 1.

  • Pre-aggregate the data yourself, so you’re only feeding the rolled up data to Druid instead of the raw data.

  • At some point, ideally Druid should do this on its own. There’s an issue for implementing a pre-indexing aggregation job in Druid: https://github.com/druid-io/druid/issues/2594

Hi Gian,

2 more questsion:

  1. Does having 1 or 24 segments per day affect the query time? Even when queryGranularity is HOUR.
  2. Is making 1 segment per day per day the same as making 24 and then merging them with a merge task? Also queryGranularity is HOUR.

Thanks,
Erik

I just tried setting tuningConfig.useCombiner to true. With segmentGranularity set to DAY now only takes 13 minutes (instead of 68) and the reducer only uses 5.1GB of memory instead of 15GB.