We have a Druid datasource consuming from Kafka. We use thetasketch metrics for few approx count use cases. We started off with 4 theta sketch metrics and now have around 10 thetasketch metrics in a single data source. We updated the schema with few more dimensions (not really high cardinality) And of course the sizes of the segments have grown bigger with the new changes. The compaction tasks have started failing lately (with peon heap is 12G & MaxDirectMemory is 2.5G).
With this situation in mind, I have the following questions:
- Is there any guideline with respect to having theta sketch metrics in a Druid data source? such as the max or recommended number of theta sketch metric per data source? And any other tips/optimization for thetasketch metrics?
- Druid doc mentions to aim for a segment size of 300-700 MB in size. Is this a hard limit? The reason why I am asking this is, if a data source has grown big (with new dimension additions & new thetasketch metrics addition), if I aim for a segment (post compaction) to be in the range of 300-700 MB, the total number of segments may possibly be in the range of 20-30. However, if the size per segment is not a hard limit, I can have the total number of segments in the range of 10-15 (post compaction) having approximately 1GB as the segment size instead of 300-700MB. Which of these options is preferred/recommended? I used to think having fewer number of segments is better since the number segments that will be accessed/read on a query will be far less but then again a single thread will have to read 1GB of data.
- The compaction tasks are failing with memory related error (OOM: Direct buffer). Is increasing the MaxDirectMemory the only way?