Druid datasource for handling multiple customers?

Hi, It would be helpful, if someone can suggest us the best way to handle multiple customer events (customers ranging from 10K events per month to 1 billion events per month) .

a) Do you suggest keeping one to one mapping for datasource & customers?

b) If all customers come under one datasource, then re-indexing a specific customer segments will become difficult

c) If different datasources, then segment sizes will differ for customers , do you suggest this?

Hi Ram,

The advantage of separate datasources is that it makes re-indexing simpler (since you can do one at a time instead of having to do all at once) and that it improves isolation between customers during realtime ingestion (so one customer sending excessive data won’t impact others). The disadvantage is that each datasource will need at minimum one realtime task, and each of those tasks is a fixed bundle of resources that generally includes at least one CPU and a couple GB of memory.

I think that in general, if you mostly have larger datasets per customer (more than a billion events per month) then it makes sense to have separate datasources. If you mostly have less than a billion events/month/customer, then for efficiency’s sake during realtime ingestion it makes sense to combine them into some smaller number of datasources (possibly one).

To answer your other two questions,

a) Yes, new datasources are created automatically. Just asking Druid to ingest into a new datasource is enough to create it.

b) Yes, if you use tranquility then you can send data to as many datasource as you want from the same Bolt. Keep in mind that since each datasource has its own tasks, each datasource will require at least one outgoing TCP connection (one per task).

Thanks Gian, it gives clarity.

Just one followup question.

We will create new datasources for large customers and we will combine small customers. Can we model the data in such a way that, the customers will be a dimension (low cardinality column) and we shard the segments by that dimension, will it ease the re-indexing ?

–Ram

yes, you can partition your data via customer dimension, however if you want to do reIndexing and modify data you will have to reIndex all the small datasources clubbed together at once.

In batch Indexing, you can use single-dimension partitionsSpec and partition your data by customer dimension.
In case of using Tranquility, you can provide a beamMergerFn to DruidBeams builder to do this.