Druid Cluster Using HDFS

Ok, so I have been reading all morning about setting up cluster and before I dive in, I wanted to make sure I was on the right track.

My goal is to ingest approximately 50 GB of data and be able to query it with sub-second latency as you all advertise.

I have basically no experience with Hadoop, but understand the general idea of what it is used for.

I see that you all have several options for batch ingestion: index service and HadoopDruidIndexer.

Since I don’t already have a Hadoop cluster running, I believe I need to use the index service and since I have a decent amount of

data (>> 1 GB) I need to use the hadoop index task???

My first question is, if I want to use HDFS do I need to install it or does it already come with Druid? Next, do you all recommend having

HDFS on a separate node or on the same node as the indexing service? Also, to tell the hadoop task to write segments to HDFS

, do I only need to update the common configuration deepstorage properties?

The machines I have access to have 30 GB RAM and 8 cores.

I am thinking that I probably will need 2 historical nodes, 1 broker node, and 1 node to host Zookeeper, MySQL (metadata store), and Coordinator

as well as the node(s) for HDFS and Indexing Service. How does this sound?

So, I presume that I need to download Druid onto each of these machines. Then I update the common configuration file on each node to have

the ip address of Zookeeper and HDFS (should this common configuration file be the same for each node irrespective of whether the node will

be historical, broker, etc.???). Then I can tune the runtime properties of each node.

Is there anything that I am really missing or need to think about?

Also, I am a bit confused about the indexing service.

For the hadoop index task do I need to worry about the Middle Manager and the Peons?

Do I need to make separate nodes for these managers and workers or does the hadoop index task

spawn them itself?

What are some of the properties I should tune to use my 30 GB RAM and 8 cores most effectively

in the indexing process (the production cluster configuration docs are a bit unclear to me as you

all creating an entirely separate middle manager node)?


You have two options:

You can use an indexing task and index the data directly. If you
chunk the data up into daily or weekly or monthly chunks (whatever
makes sense for your data size) and then ingest each of them
individually, it should successfully load eventually.

You can also use the hadoop indexing task, but this requires access to
some hadoop cluster. It sounds like you don't have one, so in order
to do this, you would need to actually start by setting up a hadoop
cluster. If you have access to a hadoop cluster but just don't have
much experience with it, that's another story and this option is a
viable alternative.

For deep storage, you probably want HDFS. But, in order to make HDFS
work, you will need access to an HDFS cluster. If you have access to
a Hadoop cluster, it will likely have HDFS already running and you can
just use that. If you do not have access to a Hadoop cluster, then
you will need to set one up for HDFS. That is separate work from
setting up the Druid cluster (at one point I thought that everyone
would have access to a Hadoop cluster, so it would be fine and
wonderful to just assume that one exists, I have since learned that I
was very naive :wink: ).

Since you are only doing batch ingestion right now, for the indexing
service you can run in "local" mode and have it just spawn tasks
locally on the overlord node. You do not need middle managers (they
are required for distributing out the work across multiple machines).
You should not have to worry about the peons either, those are spawn

You are spot on regarding configuration. In general, the common
configs are something that you want to be the same for all nodes
regardless of type.

Given your setup, the high level stuff I'd start with for node
configuration is probably set the historicals to -Xmx4g -Xms4g. For
your brokers, maybe -Xmx25g -Xms25g. Other than those, hopefully the
default tuning parameters will be good enough to get started with.

Hope that helps.