Dynamically Provisioning Datasources

Hey everyone,

I’m working on evaluating Druid for an application that’s multi-tenanted. In our current setup on SQL, we have a TenantId as part of our primary key and use that whenever querying the data. When a tenant leaves the application we simply do a DELETE on top of that TenantId. Since Druid is immutable, this doesn’t really seem practical, since it seems we’d have to reindex all of the data for everyone whenever a tenant leaves. The other option I looked into was having a datasource per tenant, but from what I’ve read druid requires a separate realtime configuration for each datasource. This seems like it would be really difficult to scale.

Anyone done anything like this? Any advice on how we might be able to implement this kind of pattern?

Thanks,

Zak Kristjanson

I would probably recommend just dealing with the "garbage" sitting
around. Likely, your active tenants will greatly outweigh the
inactive. If a "whale" goes inactive, it would probably be worth the
extra cost to reindex and remove them.

If this does become a significant source of cost for you, there are
some code changes that could likely be done to enable this sort of
cleanup.

--Eric

It may also be worth doing some sort of poor man’s garbage collection. You could periodically check how much of your Druid rows are garbage on a day-by-day basis by doing a topN or groupBy of the tenant dimension with a “count” aggregator, and comparing the result to a list of active tenants. If the garbage ratio is above, say, 50%, then reindex data for that day.