I’m trying to index one day of data using the hadoop index task. The way I’m doing it seems to be very slow. It takes over 1 hour to index one day of data. Does anyone have any tips on what to improve?
I’m using a custom build of druid 0.9.0 with some changes to make it work with Google Dataproc: https://github.com/atomx/druid/commit/3564de516cf6932ba8f1d5a1f02ff0ab2330c30e
The hadoop cluster consists of 5 machines, each with 8 cores and 30 GB memory (40 cores and 120 GB usable memory total).
My index tasks looks like this: https://gist.github.com/erikdubbelboer/b17d24b513c6233747c7
I’m setting numShards set to 0 to force a NoneShardSpec so the Coordinator can merge segments. Is this smart? Is merging a lot of small segments faster when querying?
The input consists of 405 gzip compressed csv files.
Compressed this is 1.6GB.
Uncompressed this is 15.7GB.
It’s a total of 94,369,283 lines.
The files themselves are sorted ascending on the time and only contain one hour of data. There can be up to 20 files covering the same hour.
As you can see from the task I have 24 dimensions (most with very low carnality).
Only 7 metrics (only count and doubleSums).
The produced output is only 50MB (as I said, very low carnality).
I noticed the reduce step takes by far the longest. It needs 15GB of memory and only seems to use 1 core (I guess because of the numShards).
What exactly does segmentGranularity do, and how would it affect indexing and querying? As you can see I have currently set it to DAY, I notice setting this to HOUR produces 24 segments and is slightly faster. I guess because it can use 24 reducers. Would generating 24 segments instead of one affect query time? Right after this task was finished the coordinator decided to merge these 24 segments into one. Is this one segment the same as one that would be generated by the DAY granularity task? Keep in mind queryGranularity is still HOUR in both cases.