Task stays in status Running forever

Hi All,

after a couple of days, I ve got a task that stays in Running status for ever. Here is what I’ve found in the log


2019-09-08T18:43:42,477 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_hadoop_datasource_2019-09-08T18:42:42.831Z”,
"status" : “SUCCESS”,
“duration” : 54326,
“errorMsg” : null
}
2019-09-08T18:43:42,482 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2019-09-08T18:43:42,483 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.announcement.Announcer.stop()] on object[org.apache.druid.curator.announcement.Announcer@75bf9e67].
2019-09-08T18:43:42,484 INFO [main] org.apache.druid.curator.announcement.Announcer - Stopping announcer
2019-09-08T18:43:42,484 INFO [main] org.apache.druid.curator.announcement.Announcer - unannouncing [/druid_iods/listeners/lookups/__default/http:ip-XXX.eu-west-1.compute.internal:8100]
2019-09-08T18:43:42,497 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2019-09-08T18:43:42,497 INFO [main] org.apache.druid.server.initialization.jetty.JettyServerModule - Stopping Jetty Server…
2019-09-08T18:43:42,501 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@1d39a880{HTTP/1.1,[http/1.1]}{0.0.0.0:8100}
2019-09-08T18:43:42,501 INFO [main] org.eclipse.jetty.server.session - node0 Stopped scavenging
2019-09-08T18:43:42,503 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@751686aa{/,null,UNAVAILABLE}
2019-09-08T18:43:42,506 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2019-09-08T18:43:42,506 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.server.listener.announcer.ListenerResourceAnnouncer.stop()] on object[org.apache.druid.query.lookup.LookupResourceListenerAnnouncer@54bb1194].
2019-09-08T18:43:42,506 INFO [main] org.apache.druid.server.listener.announcer.ListenerResourceAnnouncer - Unannouncing start time on [/druid_iods/listeners/lookups/__default/http:ip-XXX.eu-west-1.compute.internal:8100]
2019-09-08T18:43:42,506 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[org.apache.druid.indexing.worker.executor.ExecutorLifecycle@46702c26].
2019-09-08T18:43:42,507 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop()] on object[org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner@12c76d6e].
2019-09-08T18:43:42,507 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_hadoop_datasource_2019-09-08T18:42:42.831Z].
2019-09-08T18:43:42,507 INFO [main] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_datasource_2019-09-08T18:42:42.831Z] status changed to [FAILED].
2019-09-08T18:43:42,519 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.client.cache.CaffeineCache.close()] on object[org.apache.druid.client.cache.CaffeineCache@f08d676].
2019-09-08T18:43:42,520 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.discovery.DruidLeaderClient.stop()] on object[org.apache.druid.discovery.DruidLeaderClient@5d27b4c1].
2019-09-08T18:43:42,520 INFO [main] org.apache.druid.discovery.DruidLeaderClient - Stopped.
2019-09-08T18:43:42,520 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[org.apache.druid.curator.discovery.ServerDiscoverySelector@23396fc0].
2019-09-08T18:43:42,521 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.stop()] on object[org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager@32b112a1].
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager - DefaultBasicAuthenticatorCacheManager is stopping.
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager - DefaultBasicAuthenticatorCacheManager is stopped.
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.metrics.MonitorScheduler.stop()] on object[org.apache.druid.java.util.metrics.MonitorScheduler@34549979].
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager.stop()] on object[org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager@ce655b9].
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager - CoordinatorPollingBasicAuthorizerCacheManager is stopping.
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager - CoordinatorPollingBasicAuthorizerCacheManager is stopped.
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.query.lookup.LookupReferencesManager.stop()] on object[org.apache.druid.query.lookup.LookupReferencesManager@171aa66].
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - LookupExtractorFactoryContainerProvider is stopping.
2019-09-08T18:43:42,522 INFO [LookupExtractorFactoryContainerProvider-MainThread] org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop exited, Lookup notices are not handled anymore.
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - LookupExtractorFactoryContainerProvider is stopped.
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.discovery.DruidLeaderClient.stop()] on object[org.apache.druid.discovery.DruidLeaderClient@faec277].
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.discovery.DruidLeaderClient - Stopped.
2019-09-08T18:43:42,522 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[org.apache.druid.curator.discovery.ServerDiscoverySelector@1a536164].
2019-09-08T18:43:42,523 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider.stop()] on object[org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider@3df77dfa].
2019-09-08T18:43:42,523 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopping
2019-09-08T18:43:42,523 INFO [main] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider - stopped
2019-09-08T18:43:42,523 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.http.client.NettyHttpClient.stop()] on object[org.apache.druid.java.util.http.client.NettyHttpClient@64f3ca6].
2019-09-08T18:43:42,538 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[ServiceEmitter{serviceDimensions={service=druid/middleManager, host=ip-XXX.eu-west-1.compute.internal:8100, version=0.15.0-incubating}, emitter=HttpPostEmitter{config=HttpEmitterConfig{flushMillis=10000, flushCount=500, flushTimeOut=9223372036854775807, basicAuthentication=‘null’, batchingStrategy=ARRAY, maxBatchSize=5191680, contentEncoding=null, batchQueueSizeLimit=9, httpTimeoutAllowanceFactor=2.0, minHttpTimeoutMillis=0, recipientBaseUrl=‘https://fluentd.common.pps.io/druid_iods’}}}].
2019-09-08T18:43:49,802 WARN [BasicAuthenticatorCacheManager-Exec–0] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 9) in 783ms.
java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:161) ~[guava-16.0.1.jar:?]
at org.apache.druid.discovery.DruidLeaderClient.makeRequest(DruidLeaderClient.java:130) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.discovery.DruidLeaderClient.makeRequest(DruidLeaderClient.java:139) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.tryFetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:250) ~[?:?]
at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.lambda$fetchUserMapFromCoordinator$1(CoordinatorPollingBasicAuthenticatorCacheManager.java:191) ~[?:?]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:86) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:114) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:104) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.fetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:189) ~[?:?]
at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.lambda$start$0(CoordinatorPollingBasicAuthenticatorCacheManager.java:121) ~[?:?]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) [druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) [druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:92) [druid-core-0.15.0-incubating.jar:0.15.0-incubating]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222-ea]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222-ea]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_222-ea]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_222-ea]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222-ea]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222-ea]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222-ea]

. . .

2019-09-08T18:47:45,004 ERROR [CoordinatorPollingBasicAuthorizerCacheManager-Exec--0] org.apache.druid.java.util.common.concurrent.ScheduledExecutors - Uncaught exception.
java.util.concurrent.RejectedExecutionException: Service is closed.
**	at org.apache.druid.java.util.emitter.core.HttpPostEmitter.awaitStarted(HttpPostEmitter.java:208) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.emitter.core.HttpPostEmitter.emitAndReturnBatch(HttpPostEmitter.java:233) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.emitter.core.HttpPostEmitter.emit(HttpPostEmitter.java:226)** ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:67) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:72) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.emitter.service.AlertBuilder.emit(AlertBuilder.java:88) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.emitter.EmittingLogger$EmittingAlertBuilder.emit(EmittingLogger.java:109) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager.lambda$start$0(CoordinatorPollingBasicAuthorizerCacheManager.java:132) ~[?:?]
	at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) ~[druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:92) [druid-core-0.15.0-incubating.jar:0.15.0-incubating]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222-ea]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222-ea]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_222-ea]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_222-ea]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222-ea]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222-ea]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222-ea]

It says:

- that the task has finished with status Success

- that the task changed to fail (for no visible reason)

After that, it starts to stop resources and the exception seems to occur on a thread that need a resource ("ServiceEmitter") that has been stopped.

Is there a chance that threads are not stopped in the right order or an issue in shared resources ?

Thanks for your answer.

Erwan

It could be security related. This is what I see from the log.

BasicAuthenticatorCacheManager-Exec--0
CoordinatorPollingBasicAuthorizerCacheManager-Exec--0

Your Druid emitter is configured to have basicAuthentication=null but the services above are expecting a basic authentication mechanism.

Rommel Garcia

Hi,

What do you mean by service above ? The metrics recipient ?

Because you’re right, no authent has been set on the emitter as our fluentd (metrics recipient) does not require it.

Unless the property druid.emitter.http.basicAuthentication

Is related to the internal basic authentication mechanism ? It’s unclear to me (for now, I thini it is related to the recipient)

This failed task is one among a lot (but it prevent all the others to run ^^)

Hi,

For the record, after discussion with Jon Wei on slack, the following issue has been created : https://github.com/apache/incubator-druid/issues/8534

Here is a transcript of the discussion :

guillaume Hier à 12 h
hi there,
Does anyone use the http emitter service ?
I’d
like to know to what corresponds the
druid.emitter.http.basicAuthentication property ? Is it related to the
metric recipient ? Or to the Druid basic authentication ?
We are
encountering an issue on some tasks that may be related to the emitter
(more details here :
(https://groups.google.com/forum/#!topic/druid-user/kXYByLOnwu0 if
needed) (modifié)

Jon Wei il y a 9 heures
druid.emitter.http.basicAuthentication is for authenticating with the metric recipient

Jon Wei il y a 9 heures
based on that task log, I don’t think you’re hitting anything related to auth issues

Jon Wei il y a 9 heures
it appears to be a bug in the shutdown sequence if that part is not terminating eventually

Jon Wei il y a 9 heures
Untitled
2019-09-08T18:43:42,507
INFO [main]
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting
graceful shutdown of
task[index_hadoop_datasource_2019-09-08T18:42:42.831Z].
2019-09-08T18:43:42,507
INFO [main] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task
[index_hadoop_datasource_2019-09-08T18:42:42.831Z] status changed to
[FAILED]

Jon Wei il y a 9 heures
in the web console task view, does that task status show as SUCCESS or FAILED (modifié)

guillaume il y a 9 heures
Actually, neither of them, it stays running

Jon Wei il y a 9 heures
hmm, that sounds like a bug, can you a file a report for tracking?

Jon Wei il y a 9 heures
since the task completed with success

Jon Wei il y a 9 heures
and only got stuck during the shutdown

Jon Wei il y a 9 heures
it should be safe to just kill it

guillaume il y a 9 heures
Yes, we will do that. Killing it is what we are doing when it happens.
We
are also waiting for the next one to see if there is pattern (specific
hour, number of tasks or anything else) that could help solve this issue

Jon Wei il y a 9 heures
my
earlier question about whether it showed SUCCESS or FAILED -> I
think I recently observed this, and that extraneous FAILED status change
at the end doesn’t affect the reported task status at the overlord if the task exits successfully (modifié)

Jon Wei il y a 9 heures
for hadoop tasks

Jon Wei il y a 9 heures
i think there’s another bug there

Jon Wei il y a 9 heures
but not same cause as whatever interaction is causing this task to get stuck

guillaume il y a 2 minutes
For the record : https://github.com/apache/incubator-druid/issues/8534

``