Kafka ingestion tasks are failing with error No task in pending completion taskGroup[1] succeeded before completion timeout elapsed

Hello

I am working on Kafka Ingestion POC with Apache Druid 0.22.1 using AWS S3 as deep storage and PostgreSQL as the metadata store.
We have 4 historicals each with 800G disk space and 8 middle managers. Occasionally, ingestion tasks are failing with the below error

Here is the task status and log snippet. Looks like the shutdown was triggered due to completion timeout (1hr).

Task status

{
“id”: “index_kafka_abc_ab478e9ab6c680c_odjegddb”,
“groupId”: “index_kafka_abc”,
“type”: “index_kafka”,
“createdTime”: “2022-03-15T13:49:07.113Z”,
“queueInsertionTime”: “1970-01-01T00:00:00.000Z”,
“statusCode”: “FAILED”,
“status”: “FAILED”,
“runnerStatusCode”: “WAITING”,
“duration”: -1,
“location”: {
“host”: “ip-10-18-118-166.us-west-2.compute.internal”,
“port”: 8100,
“tlsPort”: -1
},
“dataSource”: “abc”,
“errorMsg”: “No task in pending completion taskGroup[1] succeeded before completion timeout elapsed”
}

Task logs

2022-03-15T14:57:11,595 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T14:58:11,616 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T14:59:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:00:11,590 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:01:11,593 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:02:11,600 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:03:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:04:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:05:11,592 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:06:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:07:11,596 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:08:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:09:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:10:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:11:11,591 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [2] Segments
2022-03-15T15:12:11,594 INFO [[index_kafka_abc_ab478e9ab6c680c_odjegddb]-appenderator-persist] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[abc_2022-03-15T13:00:00.000Z_2022-03-15T14:00:00.000Z_2022-03-15T13:00:01.826Z_3] at path[/druid/segments/ip-10-18-118-166.us-west-2.compute.internal:8100/ip-10-18-118-166.us-west-2.compute.internal:8100_indexer-executor__default_tier_2022-03-15T13:49:12.419Z_dd63b02531de4719a211fc2091495b810]
2022-03-15T15:12:11,598 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:12:11,602 INFO [[index_kafka_abc_ab478e9ab6c680c_odjegddb]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Dropped segment[abc_2022-03-15T13:00:00.000Z_2022-03-15T14:00:00.000Z_2022-03-15T13:00:01.826Z_3].
2022-03-15T15:13:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:14:11,589 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:15:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:16:11,592 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:17:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:18:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:19:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:20:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:21:11,590 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:22:11,589 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:23:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:24:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:25:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:26:11,589 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:27:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:28:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:29:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:30:11,594 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:31:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:32:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:33:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:34:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:35:11,589 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:36:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:37:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:38:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:39:11,604 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:40:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:41:11,589 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:42:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:43:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:44:11,588 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:45:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:46:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:47:11,587 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:48:11,589 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:49:11,589 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for [1] Segments
2022-03-15T15:49:18,079 INFO [parent-monitor-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown.
2022-03-15T15:49:18,079 INFO [Thread-70] org.apache.druid.cli.CliPeon - Running shutdown hook
2022-03-15T15:49:18,080 INFO [Thread-70] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-03-15T15:49:18,085 INFO [Thread-70] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/announcements/ip-10-18-118-166.us-west-2.compute.internal:8100]
2022-03-15T15:49:18,106 INFO [Thread-70] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/segments/ip-10-18-118-166.us-west-2.compute.internal:8100/ip-10-18-118-166.us-west-2.compute.internal:8100_indexer-executor__default_tier_2022-03-15T13:49:12.419Z_dd63b02531de4719a211fc2091495b810]
2022-03-15T15:49:18,124 INFO [Thread-70] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/PEON/ip-10-18-118-166.us-west-2.compute.internal:8100]
2022-03-15T15:49:18,144 INFO [Thread-70] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-03-15T15:49:18,150 INFO [Thread-70] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@563bd6a4{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2022-03-15T15:49:18,150 INFO [Thread-70] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-03-15T15:49:18,151 INFO [Thread-70] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@543a7605{/,null,STOPPED}
2022-03-15T15:49:18,153 INFO [Thread-70] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-03-15T15:49:18,154 INFO [Thread-70] org.apache.druid.server.coordination.ZkCoordinator - Stopping ZkCoordinator for [DruidServerMetadata{name=‘ip-10-18-118-166.us-west-2.compute.internal:8100’, hostAndPort=‘ip-10-18-118-166.us-west-2.compute.internal:8100’, hostAndTlsPort=‘null’, maxSize=0, tier=’_default_tier’, type=indexer-executor, priority=0}]
2022-03-15T15:49:18,154 INFO [Thread-70] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopping…
2022-03-15T15:49:18,154 INFO [Thread-70] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopped.
2022-03-15T15:49:18,154 INFO [Thread-70] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_kafka_abc_ab478e9ab6c680c_odjegddb].
2022-03-15T15:49:18,154 INFO [Thread-70] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping forcefully (status: [PUBLISHING])
2022-03-15T15:49:18,155 INFO [task-runner-0-priority-0] org.apache.kafka.common.metrics.Metrics - Metrics scheduler closed
2022-03-15T15:49:18,155 INFO [task-runner-0-priority-0] org.apache.kafka.common.metrics.Metrics - Closing reporter org.apache.kafka.common.metrics.JmxReporter
2022-03-15T15:49:18,155 INFO [task-runner-0-priority-0] org.apache.kafka.common.metrics.Metrics - Metrics reporters closed
2022-03-15T15:49:18,159 INFO [task-runner-0-priority-0] org.apache.kafka.common.utils.AppInfoParser - App info kafka.consumer for consumer-zpa-druid-ingester1-1 unregistered
2022-03-15T15:49:18,159 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[abc_2022-03-15T14:00:00.000Z_2022-03-15T15:00:00.000Z_2022-03-15T14:00:03.137Z] at path[/druid/segments/ip-10-18-118-166.us-west-2.compute.internal:8100/ip-10-18-118-166.us-west-2.compute.internal:8100_indexer-executor__default_tier_2022-03-15T13:49:12.419Z_dd63b02531de4719a211fc2091495b810]
2022-03-15T15:49:18,161 INFO [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{“druidNode”:{“service”:“druid/middleManager”,“host”:“ip-10-18-118-166.us-west-2.compute.internal”,“bindOnHost”:false,“plaintextPort”:8100,“port”:-1,“tlsPort”:-1,“enablePlaintextPort”:true,“enableTlsPort”:false},“nodeType”:“peon”,“services”:{“dataNodeService”:{“type”:“dataNodeService”,“tier”:"_default_tier",“maxSize”:0,“type”:“indexer-executor”,“priority”:0},“lookupNodeService”:{“type”:“lookupNodeService”,“lookupTier”:"__default"}}}].
2022-03-15T15:49:18,165 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited. Lookup notices are not handled anymore.
2022-03-15T15:49:18,165 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_kafka_abc_ab478e9ab6c680c_odjegddb”,
“status” : “SUCCESS”,
“duration” : 7206951,
“errorMsg” : null,
“location” : {
“host” : null,
“port” : -1,
“tlsPort” : -1
}
}
2022-03-15T15:49:18,167 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2022-03-15T15:49:18,167 INFO [main] org.apache.druid.cli.CliPeon - Thread [Thread[Thread-70,5,main]] is non daemon.
2022-03-15T15:49:18,167 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module] already stopped and stop was called. Silently skipping
Cannot remove shutdown hook, already shutting down!
Finished peon task
2022-03-15T15:49:18,307 INFO [Thread-70] org.apache.zookeeper.ZooKeeper - Session: 0x3000001215705af closed
2022-03-15T15:49:18,307 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3000001215705af
2022-03-15T15:49:18,313 INFO [Thread-70] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]

Any idea why segment handoff is taking time ?

Hi Dileep,

Is this issue is resolved??

This makes me wonder if the ingestion spec or topic details for a supervisor are changing?

Please increase the completion timeout and see if that helps

Yes I had the same error and it was due to a too small value of the completion_timeout. the PT1800s was not enough