We are looking to build a system that ingests data from our users while allowing for users to send custom dimensions. Our goal is to support as many custom dimensions as possible per user. We saw several potential approaches (some more reasonable than others) one could take to this problem and we are looking for feedback on the trade-offs.
- Shared datasource & dimensions
Limit each user to N custom dimensions; add N columns to the schema. This has the benefit of being simple, but obvious disadvantage of mixing data sets and multiplicative cardinality. External to druid, we would maintain some mapping of custom dimension to generic dimension name in the datasource, ie user123_productcolor -> custom_dim1. The naive approach.
- Shared datasource & custom dimensions
Add each users unique dimension to a datasource’s schema, potentially limiting each user to N custom dimensions. These additional columns would be sparsely populated in input data. Potentially would require similar external mapping layer to name columns as with approach #1 in order to avoid collision and enforce a per user dimension limit.
In some of our preliminary testing, the size of a single segment remains reasonably sized as the number of columns grows. We have yet to extensively evaluate query performance with this approach.
- Dedicated datasource per user
Create a new data source per user allowing high degree of custom dimensions. Intuitively, this sounds like it would have the best query performance and efficient segment size. We are unsure how a Druid cluster would handle this approach at the scale of tens of thousands of datasources. One concern was about overloading the coordinator with metadata work. We have yet to perform real-world benchmarks of this approach to measure the limitations of this approach.
Find some mix between approaches #2 & #3 to strike a balance between cluster performance/resource usage/stability, but tradeoff complexity. For example, use approach #2 to hold K users per datasource to constrain the total number of columns and use far fewer datasources than #3.
We are interested to hear others’ experience and/or intuition on scaling Druid in these approaches.