Multi-tenant deployment where each client has different dimension set

Hey,

imagine you are ingesting impression log of multiple clients, the Kafka events of all clients share :

  1. userId, cookieID

  2. possibly a few common dimensions with corresponding metrics

But each client can have it’s own dimension set, because their businesses differ.

For simplicity, let’s imagine we would never have to scale horizontally beyond a single node, we are just interested in real-time interactivity.

For instance this says that you can have some dimensions missing in some events, but what about this ^ schemaless case ?

Should each client have it’s own Table DataSource? So that Storm/spark-streaming that is feeding real-time node “categorizes” events by client

and stores them into different Druid DataSources?

Can this run on a simple Druid setup with a single real-time service, like the one ine Druid’s Tutorial ? Because from what I know Druid is able to ingest only single stream by default, right?

Thank you ! Jakub

Hey Jakub,

You can have a Druid schema that doesn’t explicitly specify dimensions, and in that case any field not already specified as a timestamp or metric will be ingested as a dimension. This feature is often helpful in your case.

When deciding whether to use a shared datasources, or a datasource per tenant, the considerations are usually:

Pros of datasources per tenant:

  • Each datasource can have its own schema, its own backfills, its own partitioning rules, and its own data load rules

  • Queries can be faster since there will be fewer segments to examine for a typical tenant’s query

  • You get the most flexibility

Pros of shared datasources:

  • Each datasource requires its own JVMs for realtime indexing

  • Each datasource requires its own YARN resources for hadoop batch jobs

  • Each datasource requires its own segment files on disk

  • For these reasons it can be wasteful to have a very large number of small datasources

Hi Gian,

in that case shared datasource would be sufficient.

Just one correction, I actually knew about having arbitrary dimensions is possible.

What I wanted to say that each client can have arbitrary metrics. Is that possible too with shared datasource?

Because metrics need to have pre-defined aggregator, right? Or is it possible to use Count aggregator for instance by default for these arbitrary metrics ?

Thank you for insights ! Jakub

Hey Jakub,

Arbitrary metrics are not possible, but one way people get around that is to have a few predefined metrics like “met1_sum”, “met1_min”, “met1_max”, and so on.

Hi Gian,

You can have a Druid schema that doesn’t explicitly specify dimensions, and in that case any field not already specified as a timestamp or metric will be ingested as a dimension. This feature is often helpful in your case.

I cannot confirm that, steps to reproduce :

  1. index this task : https://gist.github.com/l15k4/1054f2a127b5050c33c0 with events :

{“time”: “2015-01-01T00:00:01.000”, “gwid”: “8f14e45f-ceea-367a-9a36-dedd4bea2543”, “country”: “nzl”, “section”: 0.08739450661637685, “purchase”: “small”, “kv_0”: 0.07701255849946403, “kv_1”: 0.019320348491121905, “kv_2”: 0.2097968108921387, “kv_3”: 0.09669132192026665}

``

Where kv_* are the dynamic dimensions

  1. raw select query https://gist.github.com/l15k4/78e6474785308d2e8f61 then returns events with the 3 dimensions only : country, section and purchase.

Is there anything I need to do explicitly to enable this feature ?

Hey Jakub,

You need to have a dimensionsSpec that does not have any dimensions. It can have some dimensionExclusions. So for example:

“dimensionsSpec” : {

“dimensions” : ,

“dimensionExclusions” : [

“timestamp”,

“value”

]

}

Hi Gian,

that seems to be working, although performance is a little unexpected.

I’ve done some benchmarks and Druid seems to perform quite well.

I tested it in your distribution-docker container even with 1000 custom dimensions named “kv_1” - “kv_1000” each with its own normally distributed values (Standard Deviation 0.2)

and even with that Druid is ingesting 4000 events/s (quad-core with 12GB ram)

But this performance applies to json data files (50 MB each) where individual json event objects are not delimited by EOL “\n” … If I do so, performance drops radically (~ 8 times) and the task always fails.

So the Overlord really likes one huge line of many json objects (50MB of 1600 json objects) and cannot handle 50MB files each with ~ 1600 lines of json events/objects.

Would you please take a look at it? I’m now considering working with json files without EOLs…

Thank you for your insights Gian !

Github issue https://github.com/druid-io/druid/issues/2389 with benchmark included.