ETL jobs on Druid

Hi there!

I’m starting to use Druid as our OLAP database, and have already been able to push raw events data through hadoop index task and retrieve some insights through the REST API.

These queries on the raw data are useful to our use case, but we also need to transform last N days of this raw data and save it on another Druid data source (in batch manner - hourly/daily).

So the process should be (Every hour/day):

  • Query raw events data-source from Druid.

  • Transform and generate new data based on above query output.

  • Insert new data inside another druid data-source.

Our front-end API that query Druid cluster would the run queries on both data-sources, “raw events data” and “modified data” based on raw events.

After some research I found out that these are known as ETL jobs, however I found a lot of possible technologies as Spark, Hadoop, Storm, etc…

Which and why is the best technology for our use case?

Thanks for the time!!

There are many great data processing systems and the ones you listed are all supported by Druid. You should be able to find information online about the tradeoffs with each system.

Hi,

we currently use Spark for that.

Hadoop used to be what most people used in the big-data world for batch ETL processing a couple of years ago and now it is I believe fair so say that Spark is the most commonly used product with the biggest community around it. The new kid on the block is Flink which is comparatively easy to work with as Spark but has some distinguishing features but might not yet be as mature as Spark. (all of that is of course what I hear people say and is subjective…)

If your focus now or later is on streaming data in to Druid rather than batch processing it, Kafka is most commonly used as a message queue system and for “stream-etl’ing” data from Kafka prior to streaming it into Druid, people seem to use Spark, Flink or Samza.

best

Sascha

Thanks a lot for the feedback Sascha, really appreciate it. Gonna give it a try to Spark for this ETL processing use case. There is a lot of optimization to be done, but starting with Spark seems to be a solid ground to make improvements afterwards.

Regards,