Druid services going down without any specific error

I see this error message before services go down.

2020-07-26T18:48:25,466 INFO [Thread-131] org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module] running shutdown hook 2020-07-26T18:48:25,468 INFO [Thread-131] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS] 2020-07-26T18:48:25,469 INFO [Thread-131] org.apache.druid.curator.announcement.Announcer - Unannouncingstrong text****

This happens for all the services. The services go down after 2-3 hours. I have set the log level to debug mode and still I do not see specific errors.

My services are configured to run like this

Zookeeper cluster runs on a 3 node cluster. These nodes are different from druid cluster nodes.

Druid 0.18.1 cluster runs on 6 node cluster

2 for data services - i3.4x large

2 for master - m5.2xlarge

2 for query - m5.2xlarge

Overlord runs within coordinator.

Any ideas ?

Where is this error? Have you looked in all the logs?

Hi Rachel,

Thanks for replying. I looked at all the logs and everything goes down with the same message.

2020-07-31T20:00:18,614 INFO [Curator-PathChildrenCache-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Kaboom! Worker[localhost:8091] removed!
2020-07-31T20:00:18,615 INFO [Curator-PathChildrenCache-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - [localhost:8091]: Found 0 tasks assigned
2020-07-31T20:00:18,636 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module] running shutdown hook
2020-07-31T20:00:18,638 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2020-07-31T20:00:18,639 INFO [Thread-53] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/OVERLORD/localhost:8081]
2020-07-31T20:00:18,649 INFO [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node[http://localhost:8081] of role[overlord] went offline.
2020-07-31T20:00:18,649 INFO [Thread-53] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{“druidNode”:{“service”:“druid/coordinator”,“host”:“localhost”,“bindOnHost”:false,"pl$
2020-07-31T20:00:18,650 INFO [Thread-53] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/COORDINATOR/localhost:8081]
2020-07-31T20:00:18,652 INFO [Thread-53] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{“druidNode”:{“service”:“druid/coordinator”,“host”:“localhost”,“bindOnHost”:false,"pl$
2020-07-31T20:00:18,653 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2020-07-31T20:00:18,654 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node[http://localhost:8082] of role[broker] went offline.
2020-07-31T20:00:18,656 INFO [NodeRoleWatcher[HISTORICAL]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node[http://localhost:8083] of role[historical] went offline.
2020-07-31T20:00:18,657 INFO [Thread-53] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@58e02359{HTTP/1.1,[http/1.1]}{0.0.0.0:8081}
2020-07-31T20:00:18,660 INFO [Thread-53] org.eclipse.jetty.server.session - node0 Stopped scavenging
2020-07-31T20:00:18,662 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - Server Disappeared[DruidServerMetadata{name=‘localhost:8083’, hostAndPort=‘localhost:8083’, hostAndTlsP$
2020-07-31T20:00:18,662 INFO [Thread-53] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@41167ded{/,jar:file:/opt/druid/lib/druid-console-0.18.1.jar!/org/apache/druid/ 2020-07-31T20:00:18,667 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL] 2020-07-31T20:00:18,667 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [task-master] stage [ANNOUNCEMENTS] 2020-07-31T20:00:18,667 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [task-master] stage [SERVER] 2020-07-31T20:00:18,667 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [task-master] stage [NORMAL] 2020-07-31T20:00:18,667 INFO [Thread-53] org.apache.druid.curator.discovery.CuratorServiceAnnouncer - Unannouncing service[DruidNode{serviceName='druid/overlord', host='localhost', bindOnHost=false, port=-1,
2020-07-31T20:00:18,670 INFO [Thread-53] org.apache.druid.indexing.overlord.helpers.OverlordHelperManager - OverlordHelperManager is stopping.
2020-07-31T20:00:18,670 INFO [Thread-53] org.apache.druid.indexing.overlord.helpers.OverlordHelperManager - OverlordHelperManager is stopped.
2020-07-31T20:00:18,670 INFO [Thread-53] org.apache.druid.indexing.overlord.supervisor.SupervisorManager - SupervisorManager stopped.
2020-07-31T20:00:18,670 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Interrupted, exiting!
2020-07-31T20:00:18,670 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [task-master] stage [INIT]
2020-07-31T20:00:18,689 INFO [Thread-53] org.apache.druid.metadata.storage.derby.DerbyConnector - Stopping DerbyConnector…

2020-07-31T20:00:18,796 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2020-07-31T20:00:18,797 INFO [Thread-53] org.apache.zookeeper.ZooKeeper - Session: 0x10000047bfd0002 closed
2020-07-31T20:00:18,797 INFO [Thread-53] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]

So I noticed one thing that when I spin up the cluster on my local with same config, everything runs fine but on ec2 instances it does not. I have open up the following ports for communication as well in addition to default security policy.

2888, 3888, 2181, 3306, 8200, 8888 and from 8080 to 8100.

The cluster stays up for 4-5 hours and suddenly goes down. There is something that prompts it to go down and shuts down services.

The above logs are for coordinator-overlord .

Feels like you have an error in your Zookeeper config.

If your session gets closed by ZK, I’d wager you do not have corrum on the ZK side

Hi Marc,

Thanks for your reply. It doesn’t seem like a zookeeper issue. I spun up a single node cluster on ec2 instance and it didn’t work while I spun a single node cluster on my machine and it seemed to work fine. Same zookeeper configurations were used.

Ec2 doesn’t seem to like something about the druid service and kills it. So, I decided to spin up the services using docker containers on ec2 instances. So far the services seem to be running fine. Not sure, what is wrong with running services natively on ec2 instances.

How do you spin up your Druid locally? Docker, native Java?

What OS do you use on your EC2 instance?

Without complete log of everything I can speculate on a couple of things :

  • Do you use local storage on Amazon or you use s3 as log and or deep storage?
  • Java-jdk vs openjdk?

I had a 1-2 hour crash when I was unable to post to segments when using kafka. It wasn’t silent, but since I was trying to push the logs to an improper s3 config, I lost all traces and so all service trying to log the bubbling exception crashed and the whole cluster was in a sorry state.

Yes Native Java.

OS on EC2 instance is “centos rhel fedora” and Java version - openjdk version “1.8.0_252”

while I have MAC OSX with Java version - openjdk version “1.8.0_242”

InitiallyI tried with S3 deep storage and S3 index logs. Since the cluster was going down again and again, I tried to set up a simple cluster with default configurations that comes with druid. I just ran bin/start-cluster-single-server-medium. Ran with the same type of configurations on both local and Ec2. Ec2 doesn’t seem to like something about druid, it was going down again and again even with default configurations and I could not figure out a way to debug this . Logs don’t tell me the entire story.

Also, I am not ingesting anything in real time as of now, I loaded everything from S3 using console and loaded a very small amount of data, around 100Mbs.

Now I am using docker containers and have configured S3 for deep storage as well as indexed logs. Somehow running on docker container works.

I do want to debug what is wrong with Ec2, because this is something very strange and I am sure there must be some explanation for it.