Can I run the indexing job on 2 Hadoop cluster?


I’m in a situation where I will migrate my hadoop cluster from on-permis to Google Dataproc. During the migration period, I want to run the same index job on both hadoop clusters (on-permis and Dataproc), each will index to different datasource, then compare the data to check everything is working fine.

In Druid docs, it is mentioned to copy the hadoop XML configuration files to Druid. The question is in this situation, which XML files should I keep on Druid and how I can run the same job on 2 different Hadoop cluster.

You can do that using Affinity and have a middle manager per datasource.
My colleague and I actually talked about doing parallel execution of ingestion tasks from separate Hadoop clusters (see this part of our video and slide 32 here).
You will need to copy the Hadoop XML config files from your on-prem Hadoop cluster to MiddleManager A which should be mapped to datasource X (using the aforementioned affinity config);
And from the Google Dataproc cluster to MiddleManager B, which should be mapped to datasource Y.

However, note that you’re won’t be running the same job, you’ll actually be running 2 ingestion tasks (or jobs for that matter), which are almost identical, except perhaps the name of the target datasource.
But they would still be 2 separate ingestion tasks, not the same ingestion tasks ingesting to 2 separate datasources, as far as I know.

As a side note, I remember someone told me once that there is an option to include the Hadoop config as part of the ingestion task spec, but I’ve never tested it.

Good luck :slight_smile: