Newbie to druid, a few questions

Hello druid world,

I am a newbie to druid or analytics in general, and i have a few doubts.
We are trying to work on real time analytics and druid seems to be a right fit.

One way to do without druid would probably be
Http Event Stream -> Kafka -> Spark Streaming -> Pushes data to a db -> query over it.

Since we know the type of queries etc we will be making, we can have one or more tables where we can push the data after aggregating the data for a 5 min window(lets say)

With druid, how should our pipeline be.
Http Event Stream -> Kafka -> Druid -> query over it.

Or
Http Event Stream -> Kafka -> Spark Streaming -> Druid -> query over it.

Should we prefer to go with #1 since druid can do rollups etc.

If we go with #1, then while querying, we will have to do a lot of group By.
But if we go with #2, then imho, isn’t druid similar to any other DB.

Also, another question is in the wikipedia example set which ships with druid,
lets say in one of the graph, we wanted to get the number of users, from a country who made edits to a category(assuming category as one of the fields).

The query after ingestion will be something like
select countrycode, category, count(*) from wikipedia where time > timestamp1 group by countrycode, category.

On the other hand, if we follow the approach #2 above, and use spark streaming, and we group in spark streaming and insert directly in the database,

Out query would be select countrycode, category, count from summary_table_1 where time > timestamp1.

This query will be faster than the previous query, which makes me think that we should be going with approach #2, but then the advantages of druid seem to be less than other dbs.

This is assuming the normal approach is to not have multiple datasources/tables per kafka topic, ie if i have lot of raw data coming into druid with kafka, i should dump all the useful columns from the kafka topic into one table in druid, and try to execute all kind of queries on it.
Queries which use all/most columns will be faster, but custom queries(like category, countrycode ) will be slower because they will need to group.

We could ingest the data from kafka topic as another datasource in the table where we keep only the countrycode, and category, and rollup appropriately, but i think that will not be recommended, because that will increase kafka consumers, leading to slow performance.

Thanks

druid experts, plz advice on this.

Hey! OK I hope I can help! LOTS OF QUESTIONS :smiley: :smiley: :smiley:

Regarding your pipeline. Both designs are right! The SIMPLEST is to do Http Event Stream -> Kafka -> Druid. That is nice and neat.
The most COMPREHENSIVE is Http Event Stream -> Kafka -> Spark Streaming -> Druid. BUT Spark will not push to Druid - Druid (right now) will want to connect to Kafka. So you will really be doing Http Event Stream -> Kafka -> Spark Streaming -> Kafka -> Druid.

I would suggest your decision really is whether you NEED to use Spark. Can you do all the things you wanted to do to your events inside Druid? Like, roll-up (which you mentioned) https://www.youtube.com/watch?v=u551R7voe7w Roll-up is super efficient and will save you an entire building block in the pipeline if it does what you need - in fact it sounds like it’s doing exactly what you want. And yes, you’re right, the second query will be faster - and that’s why Druid has roll-up :smiley: :smiley:

And you can apply row filters and transformations at ingestion time in Druid if you are thinking of doing that in Spark, too. See https://druid.apache.org/docs/latest/tutorials/tutorial-transform-spec.html and https://druid.apache.org/docs/latest/ingestion/index.html#transformspec

As an aside, some people use Druid as a DB source for Spark - that’s common for doing real-time alerting, and fraud and anomaly detection when aggregated data is needed, for example.

Re: multiple tables for the same data source - my advice is to start simple. Keep one data source for one topic - remember that Druid will columnarise and index and shard data for you. THAT IS: UNLESS you need different tables to be secured differently. There’s a good article that some way through will tell you about the cautions of having multiple datasources:
https://druid.apache.org/docs/latest/querying/multitenancy.html

There’s a virtual summit next week - there’s possibly someone doing something similar to what you are trying to do?
https://go.imply.io/Virtual-Druid-Summit-III-Registration.html

Oh and you can see my face talking in this video that’s just been published where I talk about Druid where I cover at a high level what it does to the data.
https://www.youtube.com/watch?v=34xdG8C8dbg

Keep asking - we are all here to help :slight_smile:

1 Like

Thanks Peter for the detailed explanation.

The concern with rollup is that we cannot rollup the data too much because there will be one table, and that table should have the data at good-enough granular level to enable the support of all the queries we need. So, we will try to avoid spark, and hope that druid has some good tricks up its sleeve, like clever indexes. If it can give comparable or good performance with group by queries, then it will be awesome!!

I must admit that i had quickly gone through the druid docs before posting the question, also have registered for the summit!!

Thanks again!