query on modeling druid datasource


In our legacy analytic platform ,we have a metric group consisting of several metrics . All metrics within a group are applicable on certain dimension combinations. We plan to migrate this application and are evaluating druid as the aggregation engine.

At first glance , it seems to readily map to a datasource in druid. But my questions are below:

1.Metrics within a metric group have a large variance in cardinality.

Eg metric group is website stats for a dimension combination of age group of visitor , geo , etc (upto 7-8 dimensions)

one of metrics could be page hits ( roughly in billions/perday)

another metric could be number of purchases made on site(~10k per day)

2.Individual metrics within same group could be loaded from upstream by different processes /threads in any order.

We could either model all logical metrics under one dataSource (Query layer would be easy simple query)

Or we could use multiple dataSource per metric and combine all queries at query layer . Segment sizes should be pretty small ( < 100-150MB) as per my initial estimates

Can you elaborate on pros and cons of each approach in terms of storage and performance?

(We would like to store minutely aggregated data for a ttl of 3 days for realtime querying/slicing dicing

and precomputed hourly aggregates beyond the 3 days limit


Thanks and Regards


What you described is not terribly dissimilar from how we were setting up our cluster where we were pushing all our monitoring data. Where a large and diverse group of systems was pushing whatever dimensions and metrics they happened to need into one datasource.

This was great from a management perspective, but eventually we split the datasources into multiple so that each datasource was optimal for itself.

More precisely, we ended up having varying degrees of “well-behaving”-ness in our systems, and having a way to separate the screwy ones from the well behaved ones made everyone much happier in overall performance.

The down side is that it requires more supply side processing since all our monitoring datasources still draw from the same tap.

Overall: For small data it should be fine (1 shard per segment granularity), especially if the data is more or less well behaved. But think about how much you might need to grow, and how much you trust the various metric groups to behave.