I’m using druid 0.12.1 and trying to ingest a file using index_hadoop.
file size: 39G
segment size: 6.15 GB
time for indexing: ~2.9 hours (10566806 millis)
segment granularity: DAY
i have tried changing the sizeBytes and numThreads, but it didnt result in any change of ingestion_time
I have also considered to break the file into smaller files so that I can feed each to different peons, but how can I feed different files into different peon when the segment granularity is DAY. I could also change the segment granularity to somewhat small (hour) but i will fail to create segments of 600MB - 900MB.
Is there any way to decrease the time of ingestion?
Have you specified ‘numShards’ in your ingestion spec ?
Try manually setting numShards if not already doing so, See http://druid.io/docs/latest/ingestion/hadoop.html for more details on numShards config.
Thanks for the advice.
I have tried ingesting data with numShards = 11, which resulted in completing the task in 2.4 hours (8880186 millis).
-Are there any other config parameters that could alter the ingestion time?
-Is there any way to complete the task is less than an hour?
– I was looking more like using more than 1 core per task, but uses more RAM
-> after looking the logs of the ingestion, i can see that reducers(numShards in number) are being created but are being executed in sequential fashion
–> also reading in different posts suggests that hadoop cluster should be used for production environment, is this the reason reducers are being executed in sequential order?
-> also, including numShards should make ingestion faster, as per your previous post’s 3)
–> here are some stats for ingesting wikiticker-2015-09-12-sampled.json.gz
–> without numShards, 31245 millis
–> numShards 3, 45987 millis
–> numShards 10, 97317 millis
PFA, for the ingestions specified above
numShards_3.log (231 KB)
numShards_10.log (404 KB)
without_numShards.log (185 KB)