Druid indexing HDFS data poor performance

Hi all,

We are running a few druid indexing jobs using Hadoop based indexing to load data to be used by the visualization layer. The query performance is very good but unfortunately, we struggle with a poor data indexing performance. Here are the details

Druid cluster - 5 node cluster with 3 data nodes. Each data node has 32 vCPUs and 125G of memory

Input data to be ingested - parquet file of size 300 GB with 250 columns in it.

Segment granularity DAY

Hadoop cluster has 1TB memory

Job took ~ 12 Hours to complete the indexing

When the job runs I can see that the MapReduce job takes a lot of time even though it uses the full resources in the Hadoop cluster (ITB memory)

Can someone suggest anything to optimize the performance? Any configurations we can change or is it related to the infrastructure we have?

Appreciate your help!

Manu

Hey Manu,

12 hours sounds fairly excessive, I don’t have specifics but if you’ve sufficient cores to parallelize tasks 1-2 hours (or better) should be achievable and produce reasonably sized segments.

If you have a look at the task’s logs you’ll find a section “Counters” which provides some useful information that may help to explain why the job took so long. Feel free to share it if you’d like some help.

There’s also some pretty good information in this blog post by Imply: https://imply.io/post/hadoop-indexing-apache-druid-configuration-best-practices

Best regards,

Dylan

Thanks, Dylan

Let me check the logs and I will get back if I find something. From the H/W configuration perspective, do you think this is enough to handle this much of data or we need to increase it? I know it’s a tricky question but still wanted to check

Thanks!

Manu