In some use cases, a high cardinality dimension seems unavoidable, like a db table with “primary key”, some tracing events with “trace id”. And it’s very common to query these datasources with a pk filter or id filter, the performance will be great in theory. But the index and merge is quite slow, in a 500M rows merging, the cardinality of “id” will reach 490M, merge this dimension will take one hour.
io.druid.segment.IndexMerger - Starting dimension[id] with cardinality[4,888,029]
io.druid.segment.IndexMerger - Completed dimension[id] in 3,790,755 millis.
I think maybe it is because of the bitmap index, it’s time complexity is almost O(N*N).
We have tried some way to reduce the row number, like reducing segment granularity to 10 minutes, and use liner shard of 4 partitions. But the incoming message is so fast and this problem seems unavoidable.
Is there any thing we can do to fix this? Like a simple invert index (instead of bitmap) will be great helpful.