How can we index Druid fast using Batch Ingestion

Hi,
I am ingesting 10TB of CSV to druid with 5 Nodes cluster using Hortonworks 2.6.1

I am having a issues with “No space left on device” after 12 hours of ingestion due :

(1) to large log file generated by Yarn at /hadoop/yarn/log/

(2) /druid/hadoop/yarn/local/usercache/druid/appcache

-How can we avoid this issue?

-Is there anyway to make the ingestion faster?

I set ““useCombiner”: “true”” in json file. It helps to improve the ingestion performance.

Do we have other parameters to make it faster?

Thanks,

Tas

Hi Tas,

For (1), perhaps you could split the 10TB ingestion task into smaller partial tasks, with the split intervals being aligned with your segment granularity. (e.g.,if 10TB data covers 100 days and you have DAY granularity, split the source data into 10-day intervals and run a separate task for each 10-day interval)

Maybe there are also some YARN configuration properties for logging that you can set in the task jobProperties, though I’m not familiar with what’s available there.