Exact semantics of partitionSpec in tuningConfig for Hadoop Indexing

Hi,
We are about to implement a multi-tenant druid schema, so we have a tenant_id column in our data source. What does implementing the partitionSpec on an index task achieve?

Specifically does that imply that we can independently index a tenant data. Meaning we can index the data for particular tenant 1 for hour x in one indexing job and have another indexing job for another tenant 2 for the same hour x. This provides us the flexibility to independently index/reindex the data for a particular tenant.

We don’t want to have 1 data source per client, since that leads to substantial overhead.

Index task above implies to either hadoop index task or the overlord index task.

Thanks

Roshan PV

The partitionSpec specifies how data is partitioned within Druid.
Have you taken a look at:

http://druid.io/docs/0.9.0/querying/multitenancy.html

Hi Yang,

Right now we are facing many scenarios where in only 1 of 100 tenant’s data needs to be re-indexed, since the upstream data source edited some old events. Instead of re-indexing the data for all 100 clients, is it possible to re-index the data of only 1 client and retain the indexed data of the other 99 clients, if we partition the data based on tenant_id column using paritionSpec ?

Thanks

Roshan PV

Hey Roshan,

No this is not possible. Currently re-indexing any data for a time range means reindexing all data for that same time range. One possible compromise is to have shared datasources for groups of tenants, rather than a single datasource for all tenants. That way, when you want to reindex one tenant’s data, you only have to reindex that group and not everything.

Hi Gian,

Wondering what would be the cost of having a datasource per tenant to solve this issue?

thanks

Guru