We are running a few druid indexing jobs using Hadoop based indexing to load data to be used by the visualization layer. The query performance is very good but unfortunately, we struggle with a poor data indexing performance. Here are the details
Druid cluster - 5 node cluster with 3 data nodes. Each data node has 32 vCPUs and 125G of memory
Input data to be ingested - parquet file of size 300 GB with 250 columns in it.
Segment granularity DAY
Hadoop cluster has 1TB memory
Job took ~ 12 Hours to complete the indexing
When the job runs I can see that the MapReduce job takes a lot of time even though it uses the full resources in the Hadoop cluster (ITB memory)
Can someone suggest anything to optimize the performance? Any configurations we can change or is it related to the infrastructure we have?
Appreciate your help!