Exporting data from Spark to Druid

Hello.

I’m using an HDInsight cluster (Spark) from Azure to batch process a large amount of log files, and the result of this I’m storing in Hive tables. For visualization purposes I want to use Druid as backend, and for transferring the data I’m exporting the Hive data to json files in HDFS, then downloading the data to the machine running druid, and then running indexer tasks on the local json files. I’m sure that there is a more efficient way to transfer the data between Spark and Druid, does anyone have a better suggestion?

I don’t want to keep the HDInsight cluster running after the processing, so the data needs to reside on the Druid cluster in the end.

I’ve looked at https://cwiki.apache.org/confluence/display/Hive/Druid+Integration, and it seems that starting with Hive 2.2.0 I will be able to create a druid-backed hive table. Will I be able to store my results in this table and they will end up on Druid?

I think there’s also the option of having the index task of Druid fetch the json files directly from HDFS, but I would need to connect Druid to the HDInsight hadoop cluster which I haven’t done yet. At least this would save me the hassle of transferring the large files.

What do you guys think?

Best regards,

André

Hi André,

The Hive integration should be able to do that; we are working on it precisely to automate the kind of workflow that you are describing.

Basically, you will be able to execute a statement from Hive such as :

CREATE TABLE druid_table_1

STORED BY ‘org.apache.hadoop.hive.druid.DruidStorageHandler’

TBLPROPERTIES (“druid.datasource” = “datasource_1”)

AS

select timecolumn, dimension1, metric1

FROM test_table;

This will create a Druid data source called “datasource_1”. The Hive query will create the Druid segments and register them to Druid for you. Then, you can choose whether you want to query the Druid datasource directly from Druid, or you want to continue using Hive to express your queries on top of the newly created table.

Hello Jesús.

Indeed this should be included in next Apache Hive release (2.2.0), it not a stand-alone module, thus it will require upgrading to that version.

Hi André,

Were you able to try this out ?

we are looking for similar functionality where we can insert from spark to druid.

Thanks

Hello Gaurav.

No, I don’t think I did.

André

Thanks André