Today we have to submit hadoop reprocessing job to Middle manager and configure all the nodes in druid to have hadoop info.
Is there a way where we can have independent hadoop cluster execute re-processing job without having MM to intervene ?
That way scaling the re-processing tasks become easier, I can have multiple reprocessing tasks being scheduled running and then terminating the clusters without having to change configurations.
Is this something that is possible or I am looking at it in wrong direction ?
Today batch job runs 2 map reduce tasks & MM then updates segment info.
Unsure how to do this within map reduce
Yes, you could use that CLI to run Hadoop indexing jobs without a MM. However I’m not sure what problem that would be solving. The MMs do allow you to run multiple Hadoop tasks simultaneously. And, they handle locking for you so you don’t need to worry about data consistency with simultaneous jobs for the same interval (the MMs will only let one job run at once for a particular interval).
Thanks Gian. We are going to do this only for historical data so shouldn’t result into locking issues.
We run reprocess on druid every night, that involves sending a task to coordinator which in turn sends request to hadoop cluster. But then this hadoop cluster always needs to be alive since the config about hadoop cluster is stored on all nodes.
If I want to isolate the reprocess or want to run 100 reprocess in parallel it would impact my realtime processing, but if I do it on an independent cluster I can keep reprocessing as many machines I have without having to change config at any place. Also I can shut down my cluster after the reprocessing is done.