We get a stream of data for the past dates which creates a new partition to existing segments every day. This causes a huge bloat in the number of segments. We started periodically compacting segments. However, we noticed that the compaction tasks reads all partition data and rewrites all data essentially to reduce the number of segments. Is there a configuration which allows us to read only the new smaller partitions instead of rewriting already compacted partitions? Our segments are generally around 500 Mb and rewriting already compacted segments incurs a huge cost in terms of computing resources and time required for compaction. Please suggest how to proceed.
We are using Druid 0.15.1 version with a 16 gb memory for both the historical and middle manager nodes.
Happy Tuesday and Druiding…
Have you configured automatic compaction, this is configurable at the web-console? Also, I believe in this version of Druid will support minor compaction. https://github.com/apache/incubator-druid/issues/8369
In Druid, segments in the same time chunk should have the same version, so that they can be queried together. Before 0.16, Druid used to have only one type of compaction, i.e., major compaction, which reads all segments in each time chunk and compacts them all together. This sometimes could lead to some inefficiency as you noted.
In 0.16, we now have a new compaction type, minor compaction, which can compact only a portion of segments in a time chunk. The segments created by the minor compaction will have the same major version with the original segments but a higher minor version, so that they can overshadow only those old segments before compaction. I guess this could be useful for your use case but auto compaction isn’t smart enough to use minor compaction yet. You can set up the auto compaction to use minor compaction, but it will just try to compact the whole time chunk as what major auto compaction does now.
I think one possible solution could be setting up some external auto compaction system which uses minor compaction. But you will need to set it up by yourself.