Using bytes-based parser with hadoop indexing

Hi,
I’m using a my own implementation of parser for my special protobuf messages. It is based on protobuf parser so it implements ByteBufferInputRowParser, and accepts ByteBuffer messages. I use it successfully in real-time node pulling from Kafka, where kafka firehose passes bytebuffers to it.

But now, I wanted to use same parser for hadoop indexing. And as far as I can see in class/method used by indexer to parse inputs: io.druid.indexer.HadoopDruidIndexerMapper#parseInputRow(Writable value, InputRowParser parser)

the mapper passes Writable to the parser. Of course Writable and ByteBuffers are incompatible. How can I use the same parser then? Do I need to implement with my parser a separate interface, let’s say BytesWritableInputRowParser ? So that my parser accepts both BytesWritable and ByteBuffer?

How then you can you use io.druid.data.input.ProtoBufInputRowParser that is shipped with druid, with hadoop indexing ? As I see it, you cannot.

Am I right or I’m missing something here?

Thanks for your comments!

Krzysiek

Hi,

There were a few bugs in the past versions that wouldn’t allow you to use arbitrary parsers in hadoop based ingestion. Please give a try to druid-0.8.2-rc1.
In general you will configure a hadoop InputFormat (TextInputFormat is default) and a InputRowParser. For all the records returned from InputFormat, “value” will be passed to configured InputRowParser which then should be able to parse it into a InputRow .

– Himanshu