Kafka - Druid - UI pipeline - need help to design a cluster

Hi! We think to try to use druid in production but I am new to the druid and never used it in production so naturally, I have some questions. If anybody can help me I would appreciate that!

So the system we designed is pretty basic feature-wise but what I need is an ability to query it on the sub-second level.

Here’s our design:

Volume: about 10k events/sec (1mln/min)


User API is capturing events and sends it to Kafka for processing with several samza jobs (we adding some metrics and doing some calculations there). Events are basically transactions between multiple parties like a pays b, and b pays c so it’s two transaction with same transaction hash, like [datetime, hash1, a->b $10, metric1, metric2, metric3], [datetime, hash1, b->c $10, metric1, metric2, metric3] and so on.

The plan is to put those transactions to druid cluster and query those from the web interface. Transactions need to be queried from every user standpoint so I plan to separate these transactions to several user datasources with different minimal granularity. Like transactions where a was participant will get to Datasource ‘user_a_hourly’ (with min granularity hour), ‘user_a_daily’ and ‘user_a_monthly’. Aggregate metrics with sums will be attached.

First of all - is it a good design? Or are there a way to keep multiple minimum granularities on the same datasource? I am sorry I am pretty new to druid so a lot of stuff is not straightforward to me.

Then we need to query really fast and find topN for our metrics. So the idea is if query if from Jan 5 - Mar 10 we query jan5-jan30 data source with daily granularity then for Feb data source with monthly granularity and then 1-10 March daily again. Is this a good approach? Or not much I can gain by having different data sources with different min granularities?

Then next big question - how system design should look like to allow sub-second query intervals? Is that a lot of historicals? or a lot of brokers? Or combination of that?

Is there a way to calculate it basing on #events flowing in/#events in datastore/# queries/min?

Thank you.