Hadoop Indexer and non-text batch ingestion (parquet)

Hello,

i am out of ideas.

I made druid work with hadoop 2.6. Then i made druid work with jackson 2.3.5. So finally i have no classpath issues anymore. OK.

Then i use an AvroParquetInputFormat, that returns me from the files on HDFS the object, that i have stored in those parquet files.

That object is not of type Writable, but Druid complains, that it must be a Writable. I get a ClasscastException.

Ok. I thought, i can work around that, by creating an extra InputFormat, that wraps this object into a Writable. Now my parser is able to convert it into InputRow.

Finally Druid tries to figure out the segments by doing sorting magic. But no, now i get another ClasscastException. It expects my Writeable to be of Text.class

And here i am lost.

I get the feeling, that the whole HadoopIndexTask is only for Text.

Is this true?

If so, one has to write its own HadoopIndexer. Well, this is probably not a big deal, if it wasnt about the whole packaging and uploading to hadoop. :slight_smile:

What do other experience?

Hi,

Currently , hadoop ingestion expects that InputFormat returns records of Text type. That will change with
https://github.com/druid-io/druid/pull/1472

Once above PR is merged, you should be able to use your new InputFormat to ingest parquet/avro data.

– Himanshu

This looks good. But is having a BytesWritable still necessary? When i read objects via AvroParquetInputFormat, i dont get a Writable at all. Hm.

ByteWritable is just used as serialization format to transfer data from mappers to reducers. It puts not restriction on what InputFormat returns. Only condition is that value type of records returned by your InputFormat are understood by the configured InputRowParser which should be able to create InputRow from it.

– Himanshu

I’m noob with Druid, and I’ll need to run the Hadoop batch ingestion (or indexing service) on parquet files.
I’m not sure how to do it , event after reading this discussion.

Would it be possible for someone to just show me a command line run of the haddop inxeder against parquet files (with accompagning spec) ?

Briche, Druid doesn’t support Parquet files. It has its own custom column and file format.

Yes, I know Druid can’t query parquet file directly and has its own storage format.
I wanted to know if it’s possible to run the hadoop indexing task against parquet files, the same way as you can do it with other format like files with one json per line.

Hi Briche, ah, yeah, so Druid doesn’t support that out of the box right now but it should be possible to extend the code to do this.

Hi,

Please see the support for avro https://github.com/druid-io/druid/pull/1858 , you can do same for parquet, you would have to basically write a druid InputRowParser and a hadoop InputFormat .

– Himanshu