Kill task is failing in HA hdfs deep storage

Hi,

I am currently running 0.11.0 version of druid having HA HDFS setup for deep storage. I am trying to delete the segments through the kill task, but it is failing with below response.

{“task”: “kill_2018-12-19T00:00:00.000Z_2018-12-20T00:00:00.000Z_2019-02-08T09:00:33.858Z",“status”: {“id”: "kill_2018-12-19T00:00:00.000Z_2018-12-20T00:00:00.000Z_2019-02-08T09:00:33.858Z”,“status”: “FAILED”,“duration”: 6058}}

When I checked the logs, it is showing that it is connected through standby hdfs namenode. But standby namenode doesn’t have write/execute permission and there itself it got failed by throwing an error.

Logs for reference

org.apache.hadoop.ipc.Client - Connecting to / of standby server.

DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client - IPC Client (332998175) connection to / from root sending #0

2019-02-08T09:00:38,979 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KillTask{id=kill_<ds-name>_2018-12-19T00:00:00.000Z_2018-12-20T00:00:00.000Z_2019-02-08T09:00:33.858Z, type=kill, dataSource=<dsname>}]
io.druid.segment.loading.SegmentLoadingException: Unable to kill segment
	at io.druid.storage.hdfs.HdfsDataSegmentKiller.kill(HdfsDataSegmentKiller.java:122) ~[?:?]
	at io.druid.segment.loading.OmniDataSegmentKiller.kill(OmniDataSegmentKiller.java:46) ~[druid-server-0.11.0.jar:0.11.0]
	at io.druid.indexing.common.task.KillTask.run(KillTask.java:103) ~[druid-indexing-service-0.11.0.jar:0.11.0]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.11.0.jar:0.11.0]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.11.0.jar:0.11.0]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: org.apache.hadoop.ipc.RemoteException: Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88)
	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1983)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1382)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:2934)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1124)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:873)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603)

	at org.apache.hadoop.ipc.Client.call(Client.java:1475) ~[?:?]
	at org.apache.hadoop.ipc.Client.call(Client.java:1412) ~[?:?]
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) ~[?:?]
	at com.sun.proxy.$Proxy67.getFileInfo(Unknown Source) ~[?:?]
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) ~[?:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_191]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_191]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_191]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_191]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) ~[?:?]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[?:?]
	at com.sun.proxy.$Proxy68.getFileInfo(Unknown Source) ~[?:?]
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108) ~[?:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) ~[?:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) ~[?:?]
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[?:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) ~[?:?]
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) ~[?:?]
	at io.druid.storage.hdfs.HdfsDataSegmentKiller.kill(HdfsDataSegmentKiller.java:69) ~[?:?]
	... 8 more
2019-02-08T09:00:38,987 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [kill_<ds-name>_2018-12-19T00:00:00.000Z_2018-12-20T00:00:00.000Z_2019-02-08T09:00:33.858Z] status changed to [FAILED].
2019-02-08T09:00:38,987 DEBUG [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - getComponentProvider(com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider)
2019-02-08T09:00:38,987 INFO [main] com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory - Binding com.fasterxml.jackson.jaxrs.smile.JacksonSmileProvider to GuiceManagedComponentProvider with the scope "Singleton"
2019-02-08T09:00:38,989 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "kill_<ds-name>_2018-12-19T00:00:00.000Z_2018-12-20T00:00:00.000Z_2019-02-08T09:00:33.858Z",
  "status" : "FAILED",
  "duration" : 202
}


I have a confusion why it is not trying to connect to other ip < i.e. actual namenode>, which is mentioned in the config file instead of failing. Can someone please help me to resolve it.

Regards,
Manish!!

Hi,
Did you mention your name service in your hdfs-site.xml ?

If you mention by hard coding your active name node, it might work only as long as that active name node is active.

Once that becomes passive or stand by name node , your job will fail and to get around this, you might have to mention your name service(instead of hard coding active name node value) in both hdfs-site.xml and common.runtime.properties where you mention your deep storage values.

Hope this helps.

Thanks,

–siva

Hi siva,

Below are my config for core.site and hdfs-site.xml.

  1. core.xml

fs.defaultFS

hdfs://prodhdfscluster

dfs.permissions

false

  1. Hdfs-site.xml

dfs.data.dir

/opt/hadoop/hadoop/dfs/name/data

true

dfs.name.dir

/opt/hadoop/hadoop/dfs/name

true

dfs.replication

2

dfs.nameservices

prodhdfscluster

dfs.ha.namenodes.prodhdfscluster

nn-1,nn-2

dfs.namenode.shared.edits.dir

qjournal:///prodhdfscluster

dfs.journalnode.edits.dir

/opt/jounralnode/data/

dfs.namenode.rpc-address.prodhdfscluster.nn-1

<namnode-1 -IP>

dfs.namenode.rpc-address.prodhdfscluster.nn-2

<namnode-2 -IP>

dfs.namenode.http-address.prodhdfscluster.nn-1

namnode-1 -IP>

dfs.namenode.http-address.prodhdfscluster.nn-2

<namnode-2 -IP>

dfs.client.failover.proxy.provider.prodhdfscluster

org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

dfs.ha.fencing.methods

shell(/bin/true)

dfs.ha.fencing.ssh.private-key-files

/home/hadoop/.ssh/id_rsa

dfs.ha.automatic-failover.enabled

true

Below is my common.runtime.properties config.

druid.storage.type=hdfs

druid.storage.storageDirectory=hdfs://prodhdfscluster/opt/hadoop/dfs/name/data