io.druid.curator.discovery.ServerDiscoverySelector - No server instance found for [coordinator]

Hello,

We are seeing this error recently and we are not sure what is causing this error. I was assuming that middleManager cannot find the instance of coordinator but our common.runtime.properties look fine and they show the value druid.selectors.coordinator.serviceName=coordinator

.I have tried to restart coordinator and it works for couple of mins and breaks down due to the same issue. Do we have any long term solution for this and why is this caused ?

Thanks!

2019-02-01T15:42:56,711 INFO [main] io.druid.query.lookup.LookupReferencesManager - LookupReferencesManager is starting.

2019-02-01T15:42:56,717 ERROR [main] io.druid.curator.discovery.ServerDiscoverySelector - No server instance found for [coordinator]

2019-02-01T15:42:56,718 INFO [NodeTypeWatcher[coordinator]] io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher - Received INITIALIZED in node watcher for type [coordinator].

2019-02-01T15:42:56,719 WARN [main] io.druid.java.util.common.RetryUtils - Failed on try 1, retrying in 1,006ms.

io.druid.java.util.common.IOE: No known server

at io.druid.discovery.DruidLeaderClient.getCurrentKnownLeader(DruidLeaderClient.java:276) ~[druid-server-0.12.3.jar:0.12.3]

at io.druid.discovery.DruidLeaderClient.makeRequest(DruidLeaderClient.java:128) ~[druid-server-0.12.3.jar:0.12.3]

at io.druid.query.lookup.LookupReferencesManager.fetchLookupsForTier(LookupReferencesManager.java:569) ~[druid-server-0.12.3.jar:0.12.3]

at io.druid.query.lookup.LookupReferencesManager.tryGetLookupListFromCoordinator(LookupReferencesManager.java:420) ~[druid-server-0.12.3.jar:0.12.3]

at io.druid.query.lookup.LookupReferencesManager.lambda$getLookupListFromCoordinator$4(LookupReferencesManager.java:398) ~[druid-server-0.12.3.jar:0.12.3]

at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:63) [java-util-0.12.3.jar:0.12.3]

at io.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) [java-util-0.12.3.jar:0.12.3]

at io.druid.query.lookup.LookupReferencesManager.getLookupListFromCoordinator(LookupReferencesManager.java:388) [druid-server-0.12.3.jar:0.12.3]

at io.druid.query.lookup.LookupReferencesManager.getLookupsList(LookupReferencesManager.java:365) [druid-server-0.12.3.jar:0.12.3]

at io.druid.query.lookup.LookupReferencesManager.loadAllLookupsAndInitStateRef(LookupReferencesManager.java:348) [druid-server-0.12.3.jar:0.12.3]

at io.druid.query.lookup.LookupReferencesManager.start(LookupReferencesManager.java:153) [druid-server-0.12.3.jar:0.12.3]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_144]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_144]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_144]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_144]

at io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:413) [java-util-0.12.3.jar:0.12.3]

at io.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:311) [java-util-0.12.3.jar:0.12.3]

at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:134) [druid-api-0.12.3.jar:0.12.3]

at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:101) [druid-services-0.12.3.jar:0.12.3]

at io.druid.cli.CliPeon.run(CliPeon.java:301) [druid-services-0.12.3.jar:0.12.3]

at io.druid.cli.Main.main(Main.java:116) [druid-services-0.12.3.jar:0.12.3]

Hi Naveen,

Can you check if the value of druid.service in runtime.properties for coordinator has the same value as druid.selectors.coordinator.serviceName for middle manager?****

Thanks,

Prathamesh

Hey Prathamesh,

I did verify the properties in coordinator runtime.properties and the service names in common.runtime propeties are similar.

Coordinator Runtime.Properties**:**

druid.host=localhost

druid.port=8081

druid.service=coordinator

Common Runtime.Properties:

# Indexing service discovery

druid.selectors.indexing.serviceName=overlord

druid.selectors.coordinator.serviceName=coordinator

I am not sure if i need to specify the property in middleManager runtime properties. Any leads on how to debug this are appreciated.

Thanks !

Hi Naveen,

What do you mean by “I have tried to restart coordinator and it works for couple of mins and breaks down due to the same issue” ? It seems to be just a warning. Any nodes that contact coordinator would show that warning if coordinator goes down for some time.

How are you running druid?

Are you using docker or are you running druid by starting each druid node individually?

Is this on premise or cloud?

Thanks,

Prathamesh

Hi Naveen,
Also, are you running multiple masters? Try to view the overlord UI to verify if all the MM nodes are accounted for.

Regards,

Robert

We actually run the druid nodes as Linux services on the background.When i tried restarting the service on the Linux host…we were able to see the ingestion successfully but after couple of ingestion tasks went through,the tasks fails with same error.

I am pretty sure it is an Error rather than a warning because the tasks status are failing and we don’t see the data.We have a monitoring system in place that notifies us if any service or node goes down but we did not receive any warnings about downtime on coordinator.

we run druid on different host machines manually.

Hey Robert,

The overlord UI shows all the middle managers as available and running !.We only have one instance of coordinator running currently.

Thanks !

Hi Naveen,
So is the actual issue the ingestion tasks are failing? Then, when you restart the coordinator, the ingestion tasks are working again for the first few ingestions? Where you ingesting from? Have you tried other ingest specs to see if you get similar issues? Maybe possibly posting the spec here would help.

Regards,

Robert

Hey Robert,

The main issue is ingestion tasks are failing but i wanted to post the underlying issue so that it would be feasible to respond.

So yes,when the ingestion tasks are failing i was debugging in the logs and found out this issue. I assumed that the middle-managers could not find the coordinator server and checked the run-time properties and common properties.The service names looked good but the issue persisted. I went ahead and restarted coordinator node to see if that helps. I might have helped for a little while but the issue persisted. We are trying to ingest data using supervisors.We have around 8 data-sources currently but none of them are receiving any data with different specs(but using supervisors).

Spec:

{

“type”: “kafka”,

“dataSchema”: {

“dataSource”: “test-requests”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “millis”

},

“dimensionsSpec”: {

“dimensions”: [

“site”,

“env”,

“host”,

“method”,

“statuscode”,

“bytes”,

“duration”,

“resource_type”,

“repo”,

“clientip”,

“timestamp”,

“username”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “longSum”,

“name”: “bytesSum”,

“fieldName”: “bytes”,

“expression”: null

},

{

“type”: “longSum”,

“name”: “durationSum”,

“fieldName”: “duration”,

“expression”: null

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: {

“type”: “none”

},

“rollup”: true,

“intervals”: null

},

“transformSpec”: {

“filter”: null,

“transforms”:

}

},

“tuningConfig”: {

“type”: “kafka”,

“maxRowsInMemory”: 500000,

“maxRowsPerSegment”: 5000000,

“intermediatePersistPeriod”: “PT15M”,

“basePersistDirectory”: “/druid/tmp/1544455097459-0”,

“maxPendingPersists”: 0,

“indexSpec”: {

“bitmap”: {

“type”: “concise”

},

“dimensionCompression”: “lz4”,

“metricCompression”: “lz4”,

“longEncoding”: “longs”

},

“buildV9Directly”: true,

“reportParseExceptions”: false,

“handoffConditionTimeout”: 0,

“resetOffsetAutomatically”: false,

“segmentWriteOutMediumFactory”: null,

“workerThreads”: null,

“chatThreads”: null,

“chatRetries”: 8,

“httpTimeout”: “PT10S”,

“shutdownTimeout”: “PT80S”,

“offsetFetchPeriod”: “PT30S”

},

“ioConfig”: {

“topic”: “log.test.json.request_log”,

“replicas”: 1,

“taskCount”: 1,

“taskDuration”: “PT900S”,

“consumerProperties”: {

“bootstrap.servers”: “localhost:9092”

},

“startDelay”: “PT5S”,

“period”: “PT30S”,

“useEarliestOffset”: false,

“completionTimeout”: “PT1800S”,

“lateMessageRejectionPeriod”: null,

“earlyMessageRejectionPeriod”: null,

“skipOffsetGaps”: false

},

“context”: null

}

Thanks !

Hi Naveen,

it looks that the task failed to read ZooKeeper path for the coordinator.

Do you see any errors or exceptions in the coordinator log?

Jihoon

Hi Naveen,

Can you also check if your historical nodes are at capacity? In the coordinator console look for the section where storage is specified.

historical.PNG

Thanks,

Prathamesh