Designing for custom dimensions

We are looking to build a system that ingests data from our users while allowing for users to send custom dimensions. Our goal is to support as many custom dimensions as possible per user. We saw several potential approaches (some more reasonable than others) one could take to this problem and we are looking for feedback on the trade-offs.

  1. Shared datasource & dimensions

Limit each user to N custom dimensions; add N columns to the schema. This has the benefit of being simple, but obvious disadvantage of mixing data sets and multiplicative cardinality. External to druid, we would maintain some mapping of custom dimension to generic dimension name in the datasource, ie user123_productcolor -> custom_dim1. The naive approach.

  1. Shared datasource & custom dimensions

Add each users unique dimension to a datasource’s schema, potentially limiting each user to N custom dimensions. These additional columns would be sparsely populated in input data. Potentially would require similar external mapping layer to name columns as with approach #1 in order to avoid collision and enforce a per user dimension limit.

In some of our preliminary testing, the size of a single segment remains reasonably sized as the number of columns grows. We have yet to extensively evaluate query performance with this approach.

  1. Dedicated datasource per user

Create a new data source per user allowing high degree of custom dimensions. Intuitively, this sounds like it would have the best query performance and efficient segment size. We are unsure how a Druid cluster would handle this approach at the scale of tens of thousands of datasources. One concern was about overloading the coordinator with metadata work. We have yet to perform real-world benchmarks of this approach to measure the limitations of this approach.

  1. Hybrid

Find some mix between approaches #2 & #3 to strike a balance between cluster performance/resource usage/stability, but tradeoff complexity. For example, use approach #2 to hold K users per datasource to constrain the total number of columns and use far fewer datasources than #3.

We are interested to hear others’ experience and/or intuition on scaling Druid in these approaches.

Thanks!

Hi,

For coordinator, what really matters is the number of segments and not the number of dataSources. I would try to benchmark #3 , also it would be much more manageable.

FWIW, currently we have multiple druid clusters dedicated to users but in future we might want to bring some of them on a common cluster. Sometimes, having dedicated cluster is good because the query sla is different for different people and appropriate hardware and tuning can be used for those specific cases.

– Himanshu

Thanks for the tip. We plan to benchmark a cluster with a lot of datasources. I will try to share the results.

Aside from the ordinary SLA-level external measurements, are there any internal metrics to pay special attention to as we benchmark?

-Logan

Some thoughts inline.

We are looking to build a system that ingests data from our users while allowing for users to send custom dimensions. Our goal is to support as many custom dimensions as possible per user. We saw several potential approaches (some more reasonable than others) one could take to this problem and we are looking for feedback on the trade-offs.

  1. Shared datasource & dimensions

Limit each user to N custom dimensions; add N columns to the schema. This has the benefit of being simple, but obvious disadvantage of mixing data sets and multiplicative cardinality. External to druid, we would maintain some mapping of custom dimension to generic dimension name in the datasource, ie user123_productcolor -> custom_dim1. The naive approach.

  1. Shared datasource & custom dimensions

Add each users unique dimension to a datasource’s schema, potentially limiting each user to N custom dimensions. These additional columns would be sparsely populated in input data. Potentially would require similar external mapping layer to name columns as with approach #1 in order to avoid collision and enforce a per user dimension limit.

In some of our preliminary testing, the size of a single segment remains reasonably sized as the number of columns grows. We have yet to extensively evaluate query performance with this approach.

  1. Dedicated datasource per user

Create a new data source per user allowing high degree of custom dimensions. Intuitively, this sounds like it would have the best query performance and efficient segment size. We are unsure how a Druid cluster would handle this approach at the scale of tens of thousands of datasources. One concern was about overloading the coordinator with metadata work. We have yet to perform real-world benchmarks of this approach to measure the limitations of this approach.

This has nice properties in that you can customize retention rules and resource allocations on a per user basis. Agree with Himanshu that number of segments is more important than number of datasources.