My team is evaluating Druid to do realtime analytics for clients, and we’re trying to figure out what the practical limits are when it comes to running a multi-tenant Druid cluster, with a datasource for each of our clients. Based on our application requirements it looks like using a datasource per-client is a necessity, so that we can easily control data retention on a per-client basis. We’re looking at indexing traffic for a few thousand clients, so we’re trying to figure out how reasonable it is to run a cluster that indexes data into thousands of distinct datasources.
I found a couple of prior threads on this topic (here & here) - I was wondering though if anyone with experience running this kind of architecture could share some of their experiences. In particular, if there are any headaches we’d be likely to run into and any scaling limitations that might cause problems. I’m also a little concerned at the resource requirements of having a realtime task per-datasource, it would be great to hear some real-life experience on how expensive this ends up getting.
Has anyone run a cluster with thousands of datasources that would be willing to share their experiences with me?
Thanks in advance!