Could someone please help me understand what impact does Compaction have on the size of the segments? To what extent is the size of segments reduced, and what factors impact the reduction in size of segments after compaction?
Suppose the data ingested in Druid at HOUR segment granularity for a day and size of segments is 50 GB. If I run compaction at DAY segment granularity level (target compcation size is 700 MB), to what extent should the segment size reduce post compaction for that day? Is there any formula to calculate this?
Compaction is used to reduce the number of segments created for the interval. In a given interval/segment granularity if you have 10 segments created then you can compact them into one segments based on the configuration you have set for the data-source.
There are two ways you can create compaction.
Manual compaction
This task you create manually and submit to coordinator. Here you can change the segment granularity and compact the different interval segments into one.
Check the above documentation to set it up. This can be done via API or setting it up on the druid console.
Compaction wont reduce the size of the segments. It is done to create less segments which will help is query performance and other. So if you multiple segments created for an hour granularity then you can compact all the segments to day granularity. If you target compaction size is 700 MB, then it will create multiple segments of the same size.
Compaction is basically used to get rid of small segments, I don’t think there would be much impact in size overall. I mean if you have 18 segments and each is of approximately 100 MB, Then compaction can give 2 segments each would be around 800 MB in size ( based on compaction task configuration). https://druid.apache.org/docs/latest/ingestion/data-management.html#compact