[druid-user] Supervisor tasks failing during ingestion

Hi All ,

I see most of the supervisor tasks are failing with the below error during ingestion. Could someone help me out with this .

I also saw a couple of other posts with same stack trace, however I didnt see any conclusion in those posts

Please let me know if any other logs are needed

Below are the logs .

2022-04-07T15:51:07,742 WARN [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_RECONNECTED] for node watcher of role[coordinator].
2022-04-07T15:51:07,745 WARN [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_RECONNECTED] for node watcher of role[overlord].
2022-04-07T15:51:07,759 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [SERVER]
2022-04-07T15:51:07,768 INFO [main] org.eclipse.jetty.server.Server - jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_275-8u275-b01-1~deb9u1-b01
2022-04-07T15:51:07,776 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.131.8.30:8081] of role[coordinator] detected.
2022-04-07T15:51:07,776 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node watcher of role[coordinator] is now initialized.
2022-04-07T15:51:07,792 INFO [NodeRoleWatcher[OVERLORD]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.131.8.30:8081] of role[overlord] detected.
2022-04-07T15:51:07,792 INFO [NodeRoleWatcher[OVERLORD]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node watcher of role[overlord] is now initialized.
2022-04-07T15:51:07,989 INFO [task-runner-0-priority-0] org.apache.kafka.clients.Metadata - [Consumer clientId=consumer-kafka-supervisor-nhnknjli-1, groupId=kafka-supervisor-nhnknjli] Cluster ID: 0hT5DWx-S-GS5j0KvhTNCg
2022-04-07T15:51:08,046 INFO [main] org.eclipse.jetty.server.session - DefaultSessionIdManager workerName=node0
2022-04-07T15:51:08,046 INFO [main] org.eclipse.jetty.server.session - No SessionScavenger set, using defaults
2022-04-07T15:51:08,047 INFO [main] org.eclipse.jetty.server.session - node0 Scavenging every 600000ms
2022-04-07T15:51:08,144 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider as a provider class
2022-04-07T15:51:08,145 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider as a provider class
2022-04-07T15:51:08,145 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.CustomExceptionMapper as a provider class
2022-04-07T15:51:08,145 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.ForbiddenExceptionMapper as a provider class
2022-04-07T15:51:08,145 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.BadRequestExceptionMapper as a provider class
2022-04-07T15:51:08,145 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.initialization.jetty.ServiceUnavailableExceptionMapper as a provider class
2022-04-07T15:51:08,145 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Registering org.apache.druid.server.StatusResource as a root resource class
2022-04-07T15:51:08,148 INFO [main] com.sun.jersey.server.impl.application.WebApplicationImpl - Initiating Jersey application, version ‘Jersey: 1.19.3 10/24/2016 03:43 PM’
2022-04-07T15:51:08,258 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.initialization.jetty.CustomExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
2022-04-07T15:51:08,260 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.initialization.jetty.ForbiddenExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
2022-04-07T15:51:08,260 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.initialization.jetty.BadRequestExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
2022-04-07T15:51:08,261 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.initialization.jetty.ServiceUnavailableExceptionMapper to GuiceManagedComponentProvider with the scope “Singleton”
2022-04-07T15:51:08,261 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding com.fasterxml.jackson.jaxrs.json.JacksonJsonProvider to GuiceManagedComponentProvider with the scope “Singleton”
2022-04-07T15:51:08,268 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider to GuiceManagedComponentProvider with the scope “Singleton”
2022-04-07T15:51:08,763 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.http.security.StateResourceFilter to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,786 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.http.SegmentListerResource to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,794 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.http.HistoricalResource to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,797 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.QueryResource to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,799 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.segment.realtime.firehose.ChatHandlerResource to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,835 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.http.security.ConfigResourceFilter to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,839 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.query.lookup.LookupListeningResource to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,850 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.query.lookup.LookupIntrospectionResource to GuiceInstantiatedComponentProvider
2022-04-07T15:51:08,852 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding org.apache.druid.server.StatusResource to GuiceManagedComponentProvider with the scope “Undefined”
2022-04-07T15:51:08,977 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@18887028{/,null,AVAILABLE}
2022-04-07T15:51:09,338 INFO [main] org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@75d7297d{HTTP/1.1, (http/1.1)}{0.0.0.0:8104}
2022-04-07T15:51:09,338 INFO [main] org.eclipse.jetty.server.Server - Started @15073ms
2022-04-07T15:51:09,338 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [ANNOUNCEMENTS]
2022-04-07T15:51:09,630 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[pmdata_2022-04-07T15:00:00.000Z_2022-04-07T16:00:00.000Z_2022-04-07T15:00:11.806Z_73] at new path[/druid/segments/10.129.5.98:8104/10.129.5.98:8104_indexer-executor__default_tier_2022-04-07T15:51:09.628Z_e9456b5cc39d4dd585b23e93274b33330]
2022-04-07T15:51:09,702 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Successfully started lifecycle [module]
2022-04-07T15:51:12,884 DEBUG [qtp1316735289-156] org.apache.druid.jetty.RequestLog - 10.131.8.30 GET //10.129.5.98:8104/druid/worker/v1/chat/index_kafka_pmdata_0273fa187fa63af_eiibppae/time/start HTTP/1.1
2022-04-07T15:51:17,297 DEBUG [qtp1316735289-146] org.apache.druid.jetty.RequestLog - 10.131.8.30 GET //10.129.5.98:8104/druid/worker/v1/chat/index_kafka_pmdata_0273fa187fa63af_eiibppae/offsets/current HTTP/1.1
2022-04-07T15:51:31,231 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted rows[18,652] and (estimated) bytes[300,005,736]
2022-04-07T15:51:52,274 DEBUG [qtp1316735289-147] org.apache.druid.jetty.RequestLog - 10.129.8.38 POST //10.129.5.98:8104/druid/v2/ HTTP/1.1
2022-04-07T15:51:54,649 DEBUG [qtp1316735289-148] org.apache.druid.jetty.RequestLog - 10.131.8.30 GET //10.129.5.98:8104/druid/worker/v1/chat/index_kafka_pmdata_0273fa187fa63af_eiibppae/offsets/current HTTP/1.1
2022-04-07T15:52:02,040 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted rows[18,622] and (estimated) bytes[299,998,444]
2022-04-07T15:52:41,356 DEBUG [qtp1316735289-143] org.apache.druid.jetty.RequestLog - 10.131.8.30 GET //10.129.5.98:8104/druid/worker/v1/chat/index_kafka_pmdata_0273fa187fa63af_eiibppae/offsets/current HTTP/1.1
2022-04-07T15:53:10,679 DEBUG [qtp1316735289-158] org.apache.druid.jetty.RequestLog - 10.129.8.38 POST //10.129.5.98:8104/druid/v2/ HTTP/1.1
2022-04-07T15:53:26,808 DEBUG [qtp1316735289-157] org.apache.druid.jetty.RequestLog - 10.131.8.30 GET //10.129.5.98:8104/druid/worker/v1/chat/index_kafka_pmdata_0273fa187fa63af_eiibppae/offsets/current HTTP/1.1
2022-04-07T15:53:45,454 INFO [parent-monitor-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown.
2022-04-07T15:53:45,499 INFO [Thread-513] org.apache.druid.cli.CliPeon - Running shutdown hook
2022-04-07T15:53:45,500 INFO [Thread-513] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-04-07T15:53:45,532 INFO [Thread-513] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/announcements/10.129.5.98:8104]
2022-04-07T15:53:46,050 INFO [Thread-513] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/segments/10.129.5.98:8104/10.129.5.98:8104_indexer-executor__default_tier_2022-04-07T15:51:09.628Z_e9456b5cc39d4dd585b23e93274b33330]
2022-04-07T15:53:46,084 INFO [Thread-513] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/PEON/10.129.5.98:8104]
2022-04-07T15:53:46,103 INFO [Thread-513] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-04-07T15:53:46,259 INFO [Thread-513] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@75d7297d{HTTP/1.1, (http/1.1)}{0.0.0.0:8104}
2022-04-07T15:53:46,269 INFO [Thread-513] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-04-07T15:53:46,273 INFO [Thread-513] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@18887028{/,null,STOPPED}
2022-04-07T15:53:47,236 INFO [Thread-513] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-04-07T15:53:47,236 INFO [Thread-513] org.apache.druid.server.coordination.ZkCoordinator - Stopping ZkCoordinator for [DruidServerMetadata{name=‘10.129.5.98:8104’, hostAndPort=‘10.129.5.98:8104’, hostAndTlsPort=‘null’, maxSize=0, tier=’_default_tier’, type=indexer-executor, priority=0}]
2022-04-07T15:53:47,237 INFO [Thread-513] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopping…
2022-04-07T15:53:47,237 INFO [Thread-513] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopped.
2022-04-07T15:53:47,243 INFO [Thread-513] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_kafka_pmdata_0273fa187fa63af_eiibppae].
2022-04-07T15:53:47,243 INFO [Thread-513] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping forcefully (status: [READING])
2022-04-07T15:53:47,283 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception in run() before persisting.
java.util.concurrent.RejectedExecutionException: Got Interrupted while adding to the Queue
at org.apache.druid.java.util.common.concurrent.Execs$1.rejectedExecution(Execs.java:152) ~[druid-core-0.22.1.jar:0.22.1]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_275]
at com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:484) ~[guava-16.0.1.jar:?]
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) ~[?:1.8.0_275]
at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:58) ~[guava-16.0.1.jar:?]
at org.apache.druid.segment.realtime.appenderator.StreamAppenderator.persistAll(StreamAppenderator.java:605) ~[druid-server-0.22.1.jar:0.22.1]
at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.persistAsync(StreamAppenderatorDriver.java:254) ~[druid-server-0.22.1.jar:0.22.1]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:670) [druid-indexing-service-0.22.1.jar:0.22.1]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:263) [druid-indexing-service-0.22.1.jar:0.22.1]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:146) [druid-indexing-service-0.22.1.jar:0.22.1]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:471) [druid-indexing-service-0.22.1.jar:0.22.1]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:443) [druid-indexing-service-0.22.1.jar:0.22.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) ~[?:1.8.0_275]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) ~[?:1.8.0_275]
at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353) ~[?:1.8.0_275]
at org.apache.druid.java.util.common.concurrent.Execs$1.rejectedExecution(Execs.java:149) ~[druid-core-0.22.1.jar:0.22.1]
… 16 more
2022-04-07T15:53:47,393 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited. Lookup notices are not handled anymore.
2022-04-07T15:53:47,397 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2022-04-07T15:53:47,763 INFO [Thread-513] org.apache.zookeeper.ZooKeeper - Session: 0x300eb2d52400027 closed
2022-04-07T15:53:47,764 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x300eb2d52400027
2022-04-07T15:53:48,930 INFO [Thread-513] org.apache.druid.java.util.common.lifecycle.Lifecycle$CloseableHandler - Closing object[org.asynchttpclient.DefaultAsyncHttpClient@5dddc1f9]
2022-04-07T15:53:49,424 INFO [Thread-513] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]

Regards,
Chaitanya

Hi Chaitanya,

I wonder if you hit this issue?

Best,

Mark

Hi Mark ,
Thanks for the reply , it seems like the issue is more same as this (seen in my task status) and also this
Let me know if there was a solution found for this ?

Meanwhile, I have also set useEarliestOffset as true. Prior to this change , most of the tasks were hitting that exception.
However, now I saw a series of tasks passing but some failing with OOM ( now increased the Middle manager task memory form 2Gb to 2.5Gb and I have to check this) and few more failing with the earlier exception too.

Regards,
Chaitanya

Hi Chaitanya,

Thanks for the clarification. Another thing that comes to mind: is affinityConfig enabled?

Best,

Mark