High Availability of co-ordinator/overlord nodes

Hi ,

We are doing a PoC on druid and as part of that we have set up the cluster as follows

2 coordinator/overlord/zookeeper
3 historical nodes
2 broker nodes.
Deep Storage=s3
metadata store=derby.

We would like to test some fail over scenarios and

1,Send a data ingestion task to overlord–If the node is down,is there any way we can route the request to the backup node?Or does it require some external load balancer of some kind?
2,How does coordinator work in the above scenario?Does the second coordinator node assume the leader role automatically?

Basically we would like to know how we can test the fail over scenarios for coordinator and overlord nodes

Sorry for asking some basic questions,we are pretty new to druid.

Sunil

Coordinators and Overlords have high availability but don’t require it in all cases.

In the absence of an overlord or coordinator things will continue running in the last state they knew to run in, so it is possible to run with only one of each (which is what we do in our internal-only cluster).

But, for really high availability and uninterrupted service, running with more than one is desired. The instances will use zookeeper to facilitate leader election among themselves (both coordinator and overlord do this).

We test failover regularly during the course of upgrades and other maintenance.

For sending data, using a discovery service would be required. You can either look at the zookeeper announcements directly, or hit the nodes. The non-leader nodes will redirect to the leader.

Thank you Charles.

When you say discovery service,are you referring to druid.selectors.coordinator.serviceName
property which we mention in the common runtime properties file?Also when we
submit a task we need to submit it to a running instance of overlord node right
(http://<OVERLORD_IP>:/druid/indexer/v1/task)?Is there any
way we can determine which node is up before sending the request?

For broker,we have done something programmatically to check if the service is
down ,then route the request to the next node.Since this is for a PoC we didnt
want to set up other dependencies such as external load balancer etc.

Sunil`

`On Saturday, 3 September 2016 02:42:59 UTC+5:30, charles.allen wrote:

I think I figured out where the issue is.Looks like the derby metadata storage doesn’t allow us to have high availability scenario.we are in the process of setting up a postgres metadata store .

We have setup a postgreSQL store on the EC2 instance.Modified common.runtime.properties to include the details related to the remote DB and included postgresql-metadata-storage in the extension load list.However while trying to start coordinator I get the following error.

2016-09-06T05:55:35,815 ERROR [main] io.druid.cli.CliCoordinator - Error when starting up. Failing.
com.google.inject.ProvisionException: Guice provision errors:

  1. Unknown provider[postgresql] of Key[type=io.druid.metadata.MetadataStorageConnector, annotation=[none]], known options[[derby]]
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.MetadataStorageConnector
    at io.druid.guice.JacksonConfigManagerModule.getConfigManager(JacksonConfigManagerModule.java:52)
    at io.druid.guice.JacksonConfigManagerModule.getConfigManager(JacksonConfigManagerModule.java:52)
    while locating io.druid.common.config.ConfigManager
    for parameter 0 at io.druid.common.config.JacksonConfigManager.(JacksonConfigManager.java:48)
    at io.druid.guice.JacksonConfigManagerModule.configure(JacksonConfigManagerModule.java:41)
    while locating io.druid.common.config.JacksonConfigManager
    for parameter 2 at io.druid.server.coordinator.DruidCoordinator.(DruidCoordinator.java:152)
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:149)
    while locating io.druid.server.coordinator.DruidCoordinator

  2. Unknown provider[postgresql] of Key[type=io.druid.metadata.SQLMetadataConnector, annotation=[none]], known options[[derby]]
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.SQLMetadataConnector
    for parameter 1 at io.druid.server.audit.SQLAuditManagerProvider.(SQLAuditManagerProvider.java:50)
    while locating io.druid.server.audit.SQLAuditManagerProvider
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.server.audit.AuditManagerProvider
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:142)
    while locating io.druid.audit.AuditManager
    for parameter 2 at io.druid.common.config.JacksonConfigManager.(JacksonConfigManager.java:48)
    at io.druid.guice.JacksonConfigManagerModule.configure(JacksonConfigManagerModule.java:41)
    while locating io.druid.common.config.JacksonConfigManager
    for parameter 2 at io.druid.server.coordinator.DruidCoordinator.(DruidCoordinator.java:152)
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:149)
    while locating io.druid.server.coordinator.DruidCoordinator

  3. Unknown provider[postgresql] of Key[type=io.druid.metadata.SQLMetadataConnector, annotation=[none]], known options[[derby]]
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.SQLMetadataConnector
    for parameter 3 at io.druid.metadata.SQLMetadataSegmentManagerProvider.(SQLMetadataSegmentManagerProvider.java:45)
    while locating io.druid.metadata.SQLMetadataSegmentManagerProvider
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.MetadataSegmentManagerProvider
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:134)
    while locating io.druid.metadata.MetadataSegmentManager
    for parameter 3 at io.druid.server.coordinator.DruidCoordinator.(DruidCoordinator.java:152)
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:149)
    while locating io.druid.server.coordinator.DruidCoordinator

  4. Unknown provider[postgresql] of Key[type=io.druid.metadata.SQLMetadataConnector, annotation=[none]], known options[[derby]]
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.SQLMetadataConnector
    for parameter 3 at io.druid.metadata.SQLMetadataRuleManagerProvider.(SQLMetadataRuleManagerProvider.java:52)
    while locating io.druid.metadata.SQLMetadataRuleManagerProvider
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.MetadataRuleManagerProvider
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:138)
    while locating io.druid.metadata.MetadataRuleManager
    for parameter 5 at io.druid.server.coordinator.DruidCoordinator.(DruidCoordinator.java:152)
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:149)
    while locating io.druid.server.coordinator.DruidCoordinator

  5. Unknown provider[postgresql] of Key[type=io.druid.metadata.SQLMetadataConnector, annotation=[none]], known options[[derby]]
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.SQLMetadataConnector
    for parameter 0 at io.druid.server.audit.SQLAuditManager.(SQLAuditManager.java:67)
    at io.druid.server.audit.SQLAuditManager.class(SQLAuditManager.java:51)
    while locating io.druid.server.audit.SQLAuditManager
    for parameter 5 at io.druid.metadata.SQLMetadataRuleManagerProvider.(SQLMetadataRuleManagerProvider.java:52)
    while locating io.druid.metadata.SQLMetadataRuleManagerProvider
    at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:86)
    while locating io.druid.metadata.MetadataRuleManagerProvider
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:138)
    while locating io.druid.metadata.MetadataRuleManager
    for parameter 5 at io.druid.server.coordinator.DruidCoordinator.(DruidCoordinator.java:152)
    at io.druid.cli.CliCoordinator$1.configure(CliCoordinator.java:149)
    while locating io.druid.server.coordinator.DruidCoordinator

5 errors
at com.google.inject.internal.InjectorImpl$3.get(InjectorImpl.java:1014) ~[guice-4.0-beta.jar:?]
at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1036) ~[guice-4.0-beta.jar:?]
at io.druid.guice.LifecycleModule$2.start(LifecycleModule.java:153) ~[druid-api-0.9.1.1.jar:0.9.1.1]
at io.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:91) [druid-services-0.9.1.1.jar:0.9.1.1]
at io.druid.cli.ServerRunnable.run(ServerRunnable.java:40) [druid-services-0.9.1.1.jar:0.9.1.1]
at io.druid.cli.Main.main(Main.java:105) [druid-services-0.9.1.1.jar:0.9.1.1]

I have a couple of questions with respect to the metadata store.

1,When
the cluster was setup initially I had used default derby as the metadata store.Is it okay to switch the metadata store in between? or do
I need to setup everything from beginning?

2,One thing I suspect
is that the difference in version of postgresql extension and the postgresql server installed on the ec2 instance is causing some problems.The extension directory shows postgresql-9.4.1208.jre7.jar whereas the server version is 9.2. If thats the case should I include custom extensions to solve this issue?

Regards
Sunil

The error you are seeing generally means the Postgres extension was not properly included. How did you include the extension?

Hi Fangjin,

This is how I included the extension in common runtime properties file.

druid.extensions.loadList=[“druid-kafka-eight”, “druid-s3-extensions”, “druid-histogram”, “druid-datasketches”, “druid-lookups-cached-global”, “mysql-metadata-storage”,“druid-hdfs-storage”,“postgresql-metadata-storage”]

However I don’t need all of them.druid-s3-extensions & postgresql-metadata-storage would be required.

Sunil

Hi Fangjin,

Is it the right way to include the extension?
Or am I missing anything here?

Regards
Sunil

Hmm, the way you included the extension seems correct. Can you include the logs of your node when it first starts up? It will print what extensions it has included as well as the runtime.properties it was configured with.

Hi Fangjin,

Sorry for the delay in response.I think the issue was because the extensions were included in common runtime properties and not in the historical runtime properties(it had onlys3 and hdfs extensions explicity defined).We dont have the logs right now since the druid set up was terminated after the PoC work.