I would like to know what your opinions are on the best means of sharding or partitioning the data based on a dimension. Our concern is that, while most queries will be over a few days of data, some queries we need to support with fast response time will be over years of data. These queries will, however, be constrained to a single value of a dimension (“bundle_id”). We are looking at ways that we could split the data by bundle_id so that queries will be over a much smaller data set. The two reasonable approaches for this seem to be:
- Defining the dataSource as “dataSource_<bundle_id>” or perhaps “dataSource_<hash(bundle_id) % buckets>”. This would of course create many dataSources, and feels like a bit of a hack. As such it raises the concerns that Druid is not set up to be used in this fashion. Is this the case, or is this actually a reasonable approach?
- Implementing our own shardSpec that defines the shards by a hash of the bundle_id. I see that there is already defined a HashBasedNumberedShardSpec that does something similar. The difference is that this one uses a hash of the entire row, whereas we would like to just hash the bundle_id. Same question as above - is this a reasonable approach?
What are your thoughts on these, and are we overlooking something more straightforward?
Thanks in advance, and of course thank you for all the work you all have put into Druid.