We stream ingest event data through Kafka (previously embedded tranquility). Due to the nature of the upstream systems and the data itself, we run a periodic job over the raw event data to deduplicate/cleanse it.
After the cleansed data has been produced, we then need to run a reindexing job for the interval.
We want to do this efficiently and, hopefully, painlessly. Being on AWS, using temporary EMR clusters seemed like a good idea. Bring up the cluster when you need to run a reindexing job, run the job, shut down the cluster. But we’ve learned it’s not quite that easy. To integrate Druid with an external Hadoop cluster, (if I understand this correctly) you need to copy the Hadoop cluster config files (which contain the network locations of the cluster) over to the Druid nodes, and then restart the nodes in order to load in the new config files. That’s not something that we want to be doing, say, every single day.
From what I learned Druid’s native ingestion is being/has been significantly improved to the point that an external Hadoop cluster will no longer be needed. Do we know when this work will be ready for prime time?
And, in the meantime, are there any thoughts for how to use short lived EMR clusters for scheduled re-indexing tasks, without having to jury rigg config copies and Druid restarts?