Setting up Minor Compaction

We have been looking into setting up Minor Compaction. Unfortunately we were not able to find a lot of documentation on it, so we do have some question and wondering if someone has worked on it and would be able to give some advice on these 2 questions below?

1. Suggestion for running compaction and ingestion in parallel?

Running ingestion and minor compaction in parallel: With initial tests we found that Minor Compaction is only started once data is no longer ingested on the middle manager - even if there are some segments already moved to historicals. We have been able to resolve this setting “forceTimeChunkLock” : false in the ingestion spec (as well as compaction spec). With this setting the compaction task is able to get the lock on the segments that are already on the historicals.

Is this the correct way of doing this or does anyone have any other way they have solved it or any concerns solving it this way? Currently there doesn’t seem to be much details available on forceTimeChunkLock.

2. How can we speed up compaction task?

When running the compaction task on fewer segments, the duration of compaction task is pretty quick and finishes in a few minutes. However we have tested on sample data of ~500 segments and ~50GB in an hour and at this time the compaction task runs for about 9 to 11 HOURS. This is much too long to be useful in these type of scenarios.

Our compaction spec looks something like this:
{
“type”: “compact”,
“dataSource”: “test_datasource”,
“interval”: “2021-06-22T16:00:00.000Z/2021-06-22T17:00:00.000Z”,
“dimensionsSpec”: {
“dimensions”: […] },
“metricsSpec” : [
{ “type” : “count”, “name” : “ct” }
],
“tuningConfig”: {
“type”: “index”,
“appendToExisting”: true
},
“keepSegmentGranularity”: true,
“context”: {
“forceTimeChunkLock” : false
}
}

  1. Force time chunk lock in the correct approach. The only drawback is that segments that are currently being written in a time chink will not get compacted. So you won’t get as good a compaction as you could have if the all the segments in the time chunk were compacted.

  2. You can seed up compaction by running more concurrent subtasks for the compaction. for auto compaction this is max 10% of available worker slots or atleast 1. You can change the max setting using the coordinator API Coordinator Process · Apache Druid