Scheduled compaction tasks is in 'WAITING' state for so long before failing

I have a Kafka ingestion source with HOUR granularity. The indexing tasks are created every hour and gets succeeded just fine but my Kafka source is kind of slow topic and will have late arriving events for D-1 (current day minus one) so I end up with a lot of segments with fewer rows (~60K rows) so I added a compaction task for the data source through the unified console UI. A compact task gets created every time the coordinator runs. However, the task stays on the “WAITING” state for so long and eventually fails. I don’t see any logs though. When I click the failed task and go to “logs” section, I see this message “Request failed with status code 404”.

In the coordinator log, I see the following snippet getting repeated often:

2020-03-26T05:45:02,889 INFO [TaskQueue-Manager] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[compact_test_ingestion_hourly_fnjgapdm_2020-03-26T05:30:18.760Z]: RetrieveUsedSegmentsAction{dataSource=‘test_ingestion_hourly’, intervals=[2020-03-25T04:00:00.000Z/2020-03-25T05:00:00.000Z], visibility=ONLY_VISIBLE}

2020-03-26T05:45:02,892 INFO [TaskQueue-Manager] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - [forceTimeChunkLock] is set to true in task context. Use timeChunk lock

2020-03-26T05:45:02,892 INFO [TaskQueue-Manager] org.apache.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[compact_test_ingestion_hourly_fnjgapdm_2020-03-26T05:30:18.760Z]: TimeChunkLockTryAcquireAction{, type=EXCLUSIVE, interval=2020-03-25T04:00:00.000Z/2020-03-25T05:00:00.000Z}

2020-03-26T05:45:02,892 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskLockbox - Cannot create a new taskLockPosse for request[TimeChunkLockRequest{lockType=EXCLUSIVE, groupId=‘compact_test_ingestion_hourly_fnjgapdm_2020-03-26T05:30:18.760Z’, dataSource=‘test_ingestion_hourly’, interval=2020-03-25T04:00:00.000Z/2020-03-25T05:00:00.000Z, preferredVersion=‘null’, priority=25, revoked=false}] because existing locks[[TaskLockPosse{taskLock=TimeChunkLock{type=EXCLUSIVE, groupId=‘index_kafka_test_ingestion_hourly’, dataSource=‘test_ingestion_hourly’, interval=2020-03-25T04:00:00.000Z/2020-03-25T05:00:00.000Z, version=‘2020-03-25T04:00:05.239Z’, priority=75, revoked=false}, taskIds=[index_kafka_test_ingestion_hourly_525a80d4c51d1b9_jdanjdjj]}]] have same or higher priorities

``

If I submit a manual compaction task with the failed task’s interval, it succeeds. Only the auto triggered compaction tasks fail this way. So I don’t understand what the issue is. Does it have something to do with the task priority for the compact task?

1 Like

Hi Siva,

Based on the log snippet posted, When the auto compaction was triggered for the highlighted interval, Kafka indexing task [index_kafka_test_ingestion_hourly_525a80d4c51d1b9_jdanjdjj] was already in progress and was holding an exclusive lock for the same time period, that’s the reason compact task did not get a lock for the period and failed after some attempts.

Cannot create a new taskLockPosse for request[TimeChunkLockRequest{lockType=EXCLUSIVE, groupId=‘compact_test_ingestion_hourly_fnjgapdm_2020-03-26T05:30:18.760Z’, dataSource=‘test_ingestion_hourly’, **interval=2020-03-25T04:00:00.**000Z/2020-03-25T05:00:00.000Z, preferredVersion=‘null’, priority=25, revoked=false}]

because existing locks[[TaskLockPosse{taskLock=TimeChunkLock{type=EXCLUSIVE, groupId=‘index_kafka_test_ingestion_hourly’, dataSource=‘test_ingestion_hourly’, **interval=2020-03-25T04:00:00.**000Z/2020-03-25T05:00:00.000Z, version=‘2020-03-25T04:00:05.239Z’, priority=75, revoked=false}, taskIds=[index_kafka_test_ingestion_hourly_525a80d4c51d1b9_jdanjdjj]}]] have same or higher priorities

Compaction tasks might fail due to the following reasons.

  1. If the input segments of a compaction task are removed or overshadowed before it starts, that compaction task fails immediately.
  2. If a task of a higher priority acquires a time chunk lock for an interval overlapping with the interval of a compaction task, the compaction task fails.

Once a compaction task fails, the Coordinator simply checks the segments in the interval of the failed task again and issues another compaction task in the next run.

Thanks and Regards,

Vaibhav

1 Like

In addition to previous response:
By default, At every coordinator run, this compaction (search segment to compact) policy looks up time chunks in order of newest-to-oldest and checks whether the segments in those time chunks need compaction or not.
– The search start point can be changed by setting skipOffsetFromLatest. If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - skipOffsetFromLatest).


– This is to avoid conflicts between compaction tasks and real-time tasks(in your case kafka -indexing task) . Note that realtime tasks have a higher priority than compaction tasks by default.

– Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task.

Hope this helps.

Thanks and Regards,

Vaibhav

1 Like

Hi Vaibhav, That helps. I was getting a lot of late arriving data for different time chunks (it was from a mock event ingestion source used for testing) so it makes sense why a lot of them were in ‘WAITING’ state and eventually failing. BTW, if we have P1D for ‘skipOffsetFromLatest’, compaction for segments that fall under the (current time - last 24 hours) will not be picked up. Correct?

And I’m just curious what’s the rationale behind having the compaction done for segments in order of ‘newest-to-oldest’. Can it be configured to have it done from oldest to newest?

Thanks.

1 Like