How to make indexing(index_hadoop) faster?

hey everyone,

I’m using druid 0.12.1 and trying to ingest a file using index_hadoop.

file size: 39G

segment size: 6.15 GB

partitions: 11

each_partition_size: ~560MB

time for indexing: ~2.9 hours (10566806 millis)

segment granularity: DAY

targetPartitionSize: 2500000

i have tried changing the sizeBytes and numThreads, but it didnt result in any change of ingestion_time

I have also considered to break the file into smaller files so that I can feed each to different peons, but how can I feed different files into different peon when the segment granularity is DAY. I could also change the segment granularity to somewhat small (hour) but i will fail to create segments of 600MB - 900MB.

Is there any way to decrease the time of ingestion?

Hi Sunil,
Have you specified ‘numShards’ in your ingestion spec ?

Try manually setting numShards if not already doing so, See http://druid.io/docs/latest/ingestion/hadoop.html for more details on numShards config.

Hi Nishant,

Thanks for the advice.

I have tried ingesting data with numShards = 11, which resulted in completing the task in 2.4 hours (8880186 millis).

-Are there any other config parameters that could alter the ingestion time?

-Is there any way to complete the task is less than an hour?

– I was looking more like using more than 1 core per task, but uses more RAM

Few pointers to try -

  1. check gc on the MR containers, If this is causing slowness, try increasing mem allocated to the containers.

  2. Not enough containers to run things in parallel, adding more hadoop nodes can help here.

  3. What are the number of rows in each shard ? If you want to increase more parallelism, try increasing numShards to 20.

Hey Nishant,

-> after looking the logs of the ingestion, i can see that reducers(numShards in number) are being created but are being executed in sequential fashion

–> also reading in different posts suggests that hadoop cluster should be used for production environment, is this the reason reducers are being executed in sequential order?

-> also, including numShards should make ingestion faster, as per your previous post’s 3)

–> here are some stats for ingesting wikiticker-2015-09-12-sampled.json.gz

–> without numShards, 31245 millis

–> numShards 3, 45987 millis

–> numShards 10, 97317 millis

PFA, for the ingestions specified above

numShards_3.log (231 KB)

numShards_10.log (404 KB)

without_numShards.log (185 KB)