Efficiency of druid nodes

I am trying to speed up process of loading batch data (I use hadoop indexer).

I have following questions:

  1. When should I use more than one machine ?

  2. What nodes (coordinator, broker, historical, hadoop indexer) should be on more than one computer ?

  3. What about memory for broker node ? I saw that it is needed 244 GB RAM. Really ?

Could you help me ?

Hey Tomek,

The hadoop indexing speed is mostly bounded by the speed of your Hadoop cluster, not the Druid cluster. If it seems to be taking a lot longer than you think it should, you could double check to make sure the number of reducers being generated is reasonable, and that none of them are running at or near OOM limits.

The coordinator can run on a single machine, but you might want two in production for failover. For small clusters one broker is enough (although you might want two for availability). You should have as many historicals as you need to store your data. The coordinator web console will tell you how full your historicals are.

The broker doesn’t need 244GB of RAM for most typical workloads. You can make do with a lot less. For a moderately sized cluster with some concurrency, 8 CPU / 64 GB RAM should be enough. For a small testing cluster, less is okay. You can run an entire Druid cluster, every node type, on a machine with just a few GB of RAM.

For very simple testing purposes all nodes can run on the same laptop.

Actual cluster resource needs depend a lot on what your use cases and data patterns are.

Also want to add that 0.8.3 added numerous improvements to batch ingestion performance and it should be much faster.