Scheduled Reindexing and short lived EMR clusters (q from London Apache meetup)


We stream ingest event data through Kafka (previously embedded tranquility). Due to the nature of the upstream systems and the data itself, we run a periodic job over the raw event data to deduplicate/cleanse it.

After the cleansed data has been produced, we then need to run a reindexing job for the interval.

We want to do this efficiently and, hopefully, painlessly. Being on AWS, using temporary EMR clusters seemed like a good idea. Bring up the cluster when you need to run a reindexing job, run the job, shut down the cluster. But we’ve learned it’s not quite that easy. To integrate Druid with an external Hadoop cluster, (if I understand this correctly) you need to copy the Hadoop cluster config files (which contain the network locations of the cluster) over to the Druid nodes, and then restart the nodes in order to load in the new config files. That’s not something that we want to be doing, say, every single day.

From what I learned Druid’s native ingestion is being/has been significantly improved to the point that an external Hadoop cluster will no longer be needed. Do we know when this work will be ready for prime time?

And, in the meantime, are there any thoughts for how to use short lived EMR clusters for scheduled re-indexing tasks, without having to jury rigg config copies and Druid restarts?



too early in the morning to Title correctly it seems. London Apache Kafka meetup. Little bit of a difference :wink:

Hey Dyana,

We are so glad you enjoyed the meetup! Lots of users are moving to native ingest and even more so parallel native ingest, which works great with flat files, CSV, JSON, today.

See this -

ahh, brilliant. Looks like we now need to hit the namespace change hump and test this out.

Any gotchas or just non-obvious changes that need to be made when changing from a hadoop re-indexing spec to a parallel?

For the s3 firehose looks like we can provide a set of prefixes (as opposed to a pattern). I assume that if you provide the prefix “bucket/folder/” it would pick up the files:

would that be right?