I’m been researching lamdba architectures (http://lambda-architecture.net/) and I’ve been instructed with setting up a production environment for such an architecture. I’ve decided to use an existing RabbitMQ message broker as an event source, which forwards events into Apache Storm (for the real-time part) and Apache Hadoop (for the batch part). To index both parts, Druid seemed like the best choice available.
I’ve setup a pseudo-distributed hadoop cluster and I’m trying to connect Druid to it, but I’m having some doubts. First of all, here’s my druid _common configuration:
So, I’m guessing the path I put here is the where my hadoop is storing data (which defaults to the path “/tmp/hadoop-/dfs/data”). However, i ran some examples from your twitter feed and, when I turn on my historical node and query it, it reads the data previously stored in the localFS. 1) Why? Should it not read data from hadoop and ignore the local one, since we’ve changed the storage location and restarted the node? Or are locally cached files never reset?
Next, I’ve followed http://druid.io/docs/latest/Batch-ingestion.html and I’m using the HadoopDruidIndexer to index hadoop data into my druid cluster. According to the tutorial, I don’t need indexing services running if I use this, so I should have no need for overlords, middlemanagers or peons. 2) But then, where is this run? Because I have only 1 machine, all services are here and I execute this here as well, but in a distributed cluster, which machine runs this? A dedicated one? A historical one?
As far as I understood it, this indexer loads data from Hadoop and indexes it, so that druid can understand the data. By the specFile, i have 2 paths: the inputSpec one and the segmentOutputPath. According to their description, the inputSpec path is “A String of input paths indicating where the raw data is located.” and the segmentPath is “the path to dump segments into.”. 3) These are all in the HDFS, right? Something like “/tmp/hadoop-atnogcloud/dfs/data/raw/” and “/tmp/hadoop-atnogcloud/dfs/data/patched/” . Does druid handle all the issues of knowing which data has been processed, which hasn’t, etc?
I’ve only recently come into contact with these technologies and I’m feeling overwhelmed with many new concepts. Any guidance would be appreciated!