We plan to use the Tranquility stream push model for real time ingestion of data for multiple Tenants. As discussed in the docs: http://druid.io/docs/latest/querying/multitenancy.html, We have 2 options here:
- Use same datasource for all Tenants.
- Use different datasources for each Tenant.
We also foresee a use case like migrating some Tenant’s data from one Druid cluster to some other Druid cluster.
Question 1) With this requirement, does it make sense to have each Tenant data in different data source?
Question 2) And if this is right approach then there seems to be a scalability issue with respect to number of indexing tasks in the setup. For example, with following ingestion spec definition:
- segment granularity = 1hr
- window period = 10 minutes
- partitions = 3
- replication = 2
will result in 6 concurrent indexing tasks for each Tenant, and as we have to scope in for intersecting period, druid cluster should have a capacity of 12 workers for each Tenant. So, if the number of tenants keep increasing
This does’t seems to be extendable as we are adding 12 tasks(cores) for each Tenant, so what is the preferred way for segregating the multiple tenants data in Druid, keeping in mind that we might have to migrate one tenant data to other cluster what is the best approach?
Question 3) If we go by approach 1 having same datasource for all the Tenants, then is there a mechanism using which we can export a specific tenant(on the basis of some dimension value ) data from Druid, and later on import to some other Druid cluster.