Ok, so I have been reading all morning about setting up cluster and before I dive in, I wanted to make sure I was on the right track.
My goal is to ingest approximately 50 GB of data and be able to query it with sub-second latency as you all advertise.
I have basically no experience with Hadoop, but understand the general idea of what it is used for.
I see that you all have several options for batch ingestion: index service and HadoopDruidIndexer.
Since I don’t already have a Hadoop cluster running, I believe I need to use the index service and since I have a decent amount of
data (>> 1 GB) I need to use the hadoop index task???
My first question is, if I want to use HDFS do I need to install it or does it already come with Druid? Next, do you all recommend having
HDFS on a separate node or on the same node as the indexing service? Also, to tell the hadoop task to write segments to HDFS
, do I only need to update the common configuration deepstorage properties?
The machines I have access to have 30 GB RAM and 8 cores.
I am thinking that I probably will need 2 historical nodes, 1 broker node, and 1 node to host Zookeeper, MySQL (metadata store), and Coordinator
as well as the node(s) for HDFS and Indexing Service. How does this sound?
So, I presume that I need to download Druid onto each of these machines. Then I update the common configuration file on each node to have
the ip address of Zookeeper and HDFS (should this common configuration file be the same for each node irrespective of whether the node will
be historical, broker, etc.???). Then I can tune the runtime properties of each node.
Is there anything that I am really missing or need to think about?