Need help on production configuration

Hi

Need help in Production configuration for Druid cluster

I need to setup a druid cluster with HDFS as deep storage and to run hadoop batch indexing jobs with TB scale of data

Questions

  1. What should be machine configurations(CPU, memory, disk) of ingestion service in order to perform fast ingestion. (Only batch indexing using hadoop indexing task required. N real time ingestion)

  2. What should be the number of historical and broker nodes and machine configuration for them (CPU, memory, disk)

  3. Should I used same hadoop setup for indexing job as well as HDFS deep storage

  4. Is there any impact in query performance if I use multinode HDFS for deep storage in comparison to single node Hadoop cluster

  5. What is the advantage of using multiple historical node . How it will impact on query performance

Thanks and Regards,

Ankit Gupta

Hi Ankit,

  1. Druid’s usage of Hadoop is not too different from a generic aggregation-oriented Hadoop job. I think this guideline from Hortonworks is pretty reasonable: http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.1/bk_cluster-planning-guide/content/balanced-workloads-deployments.html.

  2. The documentation has some example production configs for historical and broker nodes: http://druid.io/docs/latest/Production-Cluster-Configuration.html. The number of historicals you need will depend on the size of your indexed data, which varies from dataset to dataset. For most datasets, it will be smaller than the raw data. But you should still do some test indexing to find out for yourself. Most modestly sized deployments will be fine with 1 broker (or 2 for high-availability). Larger deployments may want to add a broker for every 20–50 historical nodes, depending on the kinds of queries being run.

  3. Sure, that would work fine.

  4. No, since historical nodes always pull down segments and cache them locally before serving queries. In other words: performance of your HDFS cluster affects how long it will take for newly indexed data to appear in your Druid cluster, but won’t affect Druid query performance.

  5. Having multiple historical nodes is useful for scaling out computation and for improving availability.

Hi Gian,

Thanks for the information.

I need to ingest 100Gb benchmarking data as provided in

http://druid.io/blog/2014/03/17/benchmarking-druid.html

Initially I tried with 2-3 Gb data ; Indexing service is taking lot of time around 30 minute for ingestion

I have Machine with (8 cores , 32 GB Ram) and I am using hadoop batch indexing with HDFS as deep storage. Also I am using Single node hadoop cluster

Questions:

How can I improve the ingestion performance

What settings need to be done on Single hadoop node and overlord config to improve performance

Do I need to add middle managers since I don’t need to execute many jobs in parallel

Thanks and Regards,

Ankit Gupta

For hadoop performance the usual thing to do is to figure out your bottleneck and then adjust for that. You can figure that out with stuff like “top” (cpu/memory usage) and “iostat -x 5” (disk usage). If you are actually maxing out some hardware resource (cpu/memory/disk) then you can consider getting more or different hardware, or possibly tuning the job depending on what’s being maxed out. If CPU is maxed, that might be GC related, so you could look into that. If disk is maxed while reducing, but you have spare memory, you could increase your memory allotment per reducer and also increase the rowFlushBoundary.

If you’re not maxing out a hardware resource, that probably means you simply aren’t getting enough parallelism from hadoop. That might mean you’re giving each job too much yarn memory, or maybe you don’t have enough memory capacity allocated to your NodeManager. It might also mean you don’t have “enough” reducers, which you could get more of by either indexing more data at a time or by using a smaller targetPartitionSize.

Another option is just to wait for it to finish :wink: