[Multi Tenancy] migrating tenant data from one druid cluster to other

Hi All

We plan to use the Tranquility stream push model for real time ingestion of data for multiple Tenants. As discussed in the docs: http://druid.io/docs/latest/querying/multitenancy.html, We have 2 options here:

  1. Use same datasource for all Tenants.
  2. Use different datasources for each Tenant.
    We also foresee a use case like migrating some Tenant’s data from one Druid cluster to some other Druid cluster.

Question 1) With this requirement, does it make sense to have each Tenant data in different data source?

Question 2) And if this is right approach then there seems to be a scalability issue with respect to number of indexing tasks in the setup. For example, with following ingestion spec definition:

  1. segment granularity = 1hr
  2. window period = 10 minutes
  3. partitions = 3
  4. replication = 2

will result in 6 concurrent indexing tasks for each Tenant, and as we have to scope in for intersecting period, druid cluster should have a capacity of 12 workers for each Tenant. So, if the number of tenants keep increasing

This does’t seems to be extendable as we are adding 12 tasks(cores) for each Tenant, so what is the preferred way for segregating the multiple tenants data in Druid, keeping in mind that we might have to migrate one tenant data to other cluster what is the best approach?

Question 3) If we go by approach 1 having same datasource for all the Tenants, then is there a mechanism using which we can export a specific tenant(on the basis of some dimension value ) data from Druid, and later on import to some other Druid cluster.

Thanks

Himanshu

Hi Himanshu,
See Inline.

Hi All

We plan to use the Tranquility stream push model for real time ingestion of data for multiple Tenants. As discussed in the docs: http://druid.io/docs/latest/querying/multitenancy.html, We have 2 options here:

  1. Use same datasource for all Tenants.
  2. Use different datasources for each Tenant.
    We also foresee a use case like migrating some Tenant’s data from one Druid cluster to some other Druid cluster.

Question 1) With this requirement, does it make sense to have each Tenant data in different data source?

yeah, IMO It makes sense to have each tenant as different datasource. this will make moving the segments from one cluster to another quite simpler, it would be just copying over the segment metadata for those datasources to the new cluster.

Question 2) And if this is right approach then there seems to be a scalability issue with respect to number of indexing tasks in the setup. For example, with following ingestion spec definition:

  1. segment granularity = 1hr
  2. window period = 10 minutes
  3. partitions = 3
  4. replication = 2

will result in 6 concurrent indexing tasks for each Tenant, and as we have to scope in for intersecting period, druid cluster should have a capacity of 12 workers for each Tenant. So, if the number of tenants keep increasing

This does’t seems to be extendable as we are adding 12 tasks(cores) for each Tenant, so what is the preferred way for segregating the multiple tenants data in Druid, keeping in mind that we might have to migrate one tenant data to other cluster what is the best approach?

You do not need to have 3 partitions for every tenant, for smaller ones you can have only 1 partition, which would mean at least 2 tasks per tenant and you can host multiple tasks per worker node. so the number of actual workers would can be less than the number of tenants.

Question 3) If we go by approach 1 having same datasource for all the Tenants, then is there a mechanism using which we can export a specific tenant(on the basis of some dimension value ) data from Druid, and later on import to some other Druid cluster.

If you really want to have all your data in single datasource and have your tenant name as a dimension, to move the data to a separate cluster, you can reIndex your data using filter to filter out data for the tenants to be moved. (more details on reindexing can be found here - http://druid.io/docs/latest/ingestion/update-existing-data.html)

http://druid.io/docs/0.9.0/querying/multitenancy.html

Hi All,

Need help to create multitenant configuration.

1/Batch Index Task with partitionsSpec. Expecting Segment’s partition to be created separately for each client but its not happening

“tuningConfig”: {

“type”: “hadoop”,

“partitionsSpec”: {

“type”: “dimension”,

“partitionDimension”: “client_id”,

“partitionDimensions”: [

“client_id”

],

“targetPartitionSize”: 2,

“maxPartitionSize”: 3,

“assumeGrouped”: false

},

“jobProperties”: {}

}

when i checked logs

-shardSpec json is not getting picked up.

-partitionsSpec json is picking up but partitionDimensions is empty

2/ Can shardSpec configuration be given for batch task. If yes than how can i configure it to run

one batch task that can generate segment’s partition on per client basis