MapReduce: existing (remote) Hadoop cluster vs Druid worker mapReducers


What if we already have a good Hadoop cluster that stores raw data and makes any jobs on it. Why we need to buy extra expensive servers just for duplicate map reduce exclusively for Druid? Yes I know that Druid can to run its jobs on remote cluster so main question is — what benefits of using separate servers for Druid mapReduces is?

And no less important question — what about segment merging? As I understood, it is done the same way, but mapper job is needed to segment to be placed somewhere unpacked at first? And then It will merged with raw data. Or you have some codec for merging segments based on and raw data? In other words, how we can merge data with segments as fast as possible and can be run on remote Hadoop cluster with same speed?


Max, Druid is designed to create segments in your existing Hadoop cluster. You shouldn’t need a dedicated Hadoop cluster for Druid.