hadoop 2.0.0-cdh4.4.0

Hi - we currently have a hadoop (hdfs/mapreduce) cluster on version 2.0.0-cdh4.4.0. Druid out-of-the-box using hadoop 2.3.0 doesn’t seem to work with our HDFS: I tried getting Overlord to run a simple batch index task, and it did create the segment, but when trying to send the segment to HDFS deep storage I see this error in the task logs:

com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).

Based on a Google search, this seems to indicate a version incompatibility. hadoop-client 2.3.0 uses protobuf 2.5.0, but hadoop 2.0.0-cdh4.4.0 uses protobuf 2.4.0a.

So I tried compiling Druid using hadoop-client 2.0.0-cdh4.4.0 but compilation fails with:

[ERROR] /Users/zcox/code/druid/indexing-hadoop/src/main/java/io/druid/indexer/DetermineHashedPartitionsJob.java:[49,45] cannot find symbol

[ERROR] symbol: class CombineTextInputFormat

Indeed, org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat does not appear to exist in 2.0.0-cdh4.4.0.

Is 2.0.0-cdh4.4.0 just too old of a hadoop version to use with Druid? I saw several other threads in the mailing list where versions very close to this were being used, but those were from 2014.

Just wanted to do a sanity check before stating to the rest of my team that we can’t use Druid with our existing Hadoop cluster.

Thanks,

Zach

Is 2.0.0-cdh4.4.0 based on Apache Hadoop 2.0.0? If so, Hadoop 2.0.0 doesn’t include CombineTextInputFormat (it was introduced in 2.1.0), which would explain the error you’re seeing. So, stock Druid won’t work with that version of Hadoop, but you could build one that does work by replacing the reference to CombineTextInputFormat.class here with something that just throws an exception: https://github.com/druid-io/druid/blob/master/indexing-hadoop/src/main/java/io/druid/indexer/JobHelper.java#L166

Things will still work because by default that code path does not execute (isCombineText defaults to false).