Multi-master Hadoop connections: 0.0.0.0:8032 failed on connection exception

Hello,

I’m required to hook Druid 0.9.1.1 up to a multi-master Hadoop cluster on Google’s Cloud infrastructure (https://cloud.google.com/dataproc/).

When the Hadoop/Dataproc cluster is multi-master, the resourcemanger does not failover correctly:

org.apache.hadoop.hdfs.BlockReaderLocal - The short-circuit local reads feature cannot

org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm1

org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm2

org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm0

org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm1

org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm2

org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm0

org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm1

2018-10-12T19:24:53,591 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_wikiticker_2018-10-12T19:09:52.394Z, type=index_hadoop, dataSource=wikiticker}]

Caused by: java.net.ConnectException: Call From instance-1/10.11.64.6 to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

Versions 0.9.1.1 and 0.9.2 have the above issue.

But I tested Druid versions 0.10.1, 0.11.0, and 0.12.3 in this scenarios, and this failing over to rm2 works immediately after rm1 fails, and the job/task completes with successfully.

I’m trying to isolate the problem, see if it comes down to jars I can swap to make this work for the Druid 0.9.1.1 cluster I’m responsible for implementing.

Any hints, advice, config changes, jars responsible, etc. would be greatly appreciated.

Thanks,

Paul

Hi Paul,

What version of Hadoop does that cluster have?

0.9.2 was the last Druid version that was built against Hadoop 2.3.0, Druid started building against Hadoop 2.7.3 instead in 0.10.1, so version incompatibilities between Druid’s hadoop jars and your cluster could explain what you’re seeing.

If that is the issue and you must use Druid 0.9.1.1 instead of a newer version, you could try replacing Hadoop 2.3.0 jars with ones that match your cluster’s version, or rebuilding Druid against your hadoop version

Thanks,

Jon