I’m starting to use Druid as our OLAP database, and have already been able to push raw events data through hadoop index task and retrieve some insights through the REST API.
These queries on the raw data are useful to our use case, but we also need to transform last N days of this raw data and save it on another Druid data source (in batch manner - hourly/daily).
So the process should be (Every hour/day):
Query raw events data-source from Druid.
Transform and generate new data based on above query output.
Insert new data inside another druid data-source.
Our front-end API that query Druid cluster would the run queries on both data-sources, “raw events data” and “modified data” based on raw events.
After some research I found out that these are known as ETL jobs, however I found a lot of possible technologies as Spark, Hadoop, Storm, etc…
Which and why is the best technology for our use case?
Thanks for the time!!