Replays per tenant

Hi Guys

We need to be able to replay and replace data per tenant. The multi-tenancy supported by Druid does not seem to support this well, need some advice on how to handle this.

Druid’s multi-tenancy does not create a separate segment for each Tenant, so data cannot be replaced for a tenant independently.

Having a datasource per tenant seems like a large overhead, as we have a lot of small tenants (and few large) and Druid apparently needs JVMs running per datasource. Is there a way to optimize ?

What are the recommendations here?

Thanks

Guru

You do not need a new jvm per data source. Please point out he documentation you read that from (or wherever you heard it) so that we can correct it.

The basic choices are to either have independent datasources per tenant or mix them together. Both choices have their pluses and minuses (mix them together and updating a single tenant is not as optimal, separate them and you have a long-tail of small tenants).

There are a number of Druid instances that are supporting multi-tenant workloads both ways, so the choice is really which tradeoff you prefer. But both can be made to work.

–Eric

Eric, Guru is probably referring to http://druid.io/docs/latest/querying/multitenancy.html, which I wrote and includes the sentence “Each datasource requires its own JVMs for realtime indexing”. I was referring to the task peon JVMs.

Maybe we should change that to “each data source is ingested independently by its own set of tasks”?

–Eric

But as per this : http://druid.io/docs/latest/design/peons.html
“Peons run a single task in a single JVM.” So in essence, no. of datasources is ~ no. of JVMS.

What am I missing here?

Here is what we plan to do get, hopefully, best of both worlds. Would be great if you could let us know your thoughts -

  1. Have a single dataSource for all tenants. Usual hash based sharding to manage size -nothing related to tenancy.

  2. When a get data that needs to be replayed for a specific tenant only, for specific timesegments -

a. Create new Druid segments out of the existing once, but filtering out the specific tenant only. We would use the “multi” ingestion spec to do this and merge it with new tenant’s data available in HDFS.

We POC’d this, and seems to work fine. Any concerns you see with this? This gives us a lot of flexibility in terms of what we can replay independently.