Druid with Spark SQL

Hi there,

I am looking for some sample examples on how to integrate Druid with Spark SQL. I have my analytic jobs running using Spark SQL but would try to use Dataframes with Druid and see if it improves the performance.

I am not clear on how do I set up Druid Index that will convert Spark SQL data frames to Druid datasources.

Any help is really appreciated.



Hi, have you looked at

Thanks for quick response Fangjin. Yes I have looked at the example but am not clear how to do it.

I have sample data which is in json format. I pass that data to spark sql and convert it to dataframe. I am using Java API to do this.

The example shows next I need to create a druid Datasource by creating a temporary table.

I am not sure what to set for druiddatasource, starschema values while creating temporary table. Also, do I need to setup Druid indexes before doing anything here ?



Hi Purvi,

  1. Yes you need to index your dataset. Currently you have to do this outside of Spark using one of the indexing methods in Druid(indexing service, HadoopIndexer etc)

  2. You need to setup a DruidDataSource in Spark to expose a Druid datasource in Spark.

Can you send me a separate email and we can help you get started.



Hi Harish, how are you?

I’m also trying to connect Spark to our current Druid environment but I’m having trouble starting thrift server, is there any more detailed documentation?

As you said, I need to create a DruidDataSource to be able to query structured data that is already loaded into druid and hadoop as deep storage. Any help will be much appreciated!


I had tried spark-line-druid for a month and it works well it help your query faster because some part of execution plan will down to druid in some part like filter sum …

Hello ,
We are doing POC using druid for that looking how to create druid data source from java spark program and also need to setup Druid indexes before doing query .

Any help or pointers will be really appreciated .