Hello druid world,
I am a newbie to druid or analytics in general, and i have a few doubts.
We are trying to work on real time analytics and druid seems to be a right fit.
One way to do without druid would probably be
Http Event Stream -> Kafka -> Spark Streaming -> Pushes data to a db -> query over it.
Since we know the type of queries etc we will be making, we can have one or more tables where we can push the data after aggregating the data for a 5 min window(lets say)
With druid, how should our pipeline be.
Http Event Stream -> Kafka -> Druid -> query over it.
Http Event Stream -> Kafka -> Spark Streaming -> Druid -> query over it.
Should we prefer to go with #1 since druid can do rollups etc.
If we go with #1, then while querying, we will have to do a lot of group By.
But if we go with #2, then imho, isn’t druid similar to any other DB.
Also, another question is in the wikipedia example set which ships with druid,
lets say in one of the graph, we wanted to get the number of users, from a country who made edits to a category(assuming category as one of the fields).
The query after ingestion will be something like
select countrycode, category, count(*) from wikipedia where time > timestamp1 group by countrycode, category.
On the other hand, if we follow the approach #2 above, and use spark streaming, and we group in spark streaming and insert directly in the database,
Out query would be select countrycode, category, count from summary_table_1 where time > timestamp1.
This query will be faster than the previous query, which makes me think that we should be going with approach #2, but then the advantages of druid seem to be less than other dbs.
This is assuming the normal approach is to not have multiple datasources/tables per kafka topic, ie if i have lot of raw data coming into druid with kafka, i should dump all the useful columns from the kafka topic into one table in druid, and try to execute all kind of queries on it.
Queries which use all/most columns will be faster, but custom queries(like category, countrycode ) will be slower because they will need to group.
We could ingest the data from kafka topic as another datasource in the table where we keep only the countrycode, and category, and rollup appropriately, but i think that will not be recommended, because that will increase kafka consumers, leading to slow performance.