Proposal: Feature to allow users to run Hadoop jobs on Druid segments

This can be enabled by just bringing DruidInputFormat( https://github.com/himanshug/druid-hadoop-utils/blob/master/druid-mr/src/main/java/com/yahoo/druid/hadoop/DruidInputFormat.java ) from druid-hadoop-utils project into druid, which is intended to be a very simple wrapper on existing DatasourceInputFormat.

Also, Currently druid-hadoop-utils depends on druid-indexing-hadoop and druid-server. For the users just wanting to run hadoop jobs on druid segments, this pulls a lot of unnecessary dependencies. To avoid that, I am going to create a new module “common-hadoop” and pull in necessary classes from “indexing-hadoop/src/main/java/io/druid/indexer/hadoop/” and make “indexing-hadoop” depend on “common-hadoop”. We would also need to move some simple stuff to druid-processing.
This will allow users to run hadoop jobs on druid segments by just including druid-common-hadoop.jar (and its dependencies druid-processing, druid-common etc).

If this sounds ok, I will soon send the PR.

– Himanshu

+1 several people have asked for this.

I meant to post this to druid-dev group. moving.