Spark-batch-indexer task

Hi all,

Currently I’m evaluating druid to use as an analytics/serving layer for a lambda architecture and wanted to see if someone had an idea how it can be used with apache spark.

I’m particularly interested if there is an equivalent of Druid’s Hadoop-based batch indexer for Apache Spark ? Or how hard is to create a Spark-batch-indexer task.

Thanks,

Artur

Hi Arthur,

Would you mind sharing a few items of your thoughts though:

What versions are you planning on running for Spark, Hadoop, and Scala?

Why do you want Spark instead of Hadoop?

Are you planning on using Spark for the initial batch indexing only? (ex: no delta ingestion)

Cheers,

Charles Allen

Also, What cluster resource management are you using for spark? [Spark, Yarn, Mesos]

Hi Allen,

The plan is to use Spark on top of Mesos. Spark will be used for streaming and for batch processing.

We basically want to use Spark only trying to not have a dependency on Hadoop/HDFS.

Streaming will consume events from Kafka and the Batch will work with Files stored in S3.

We will be writing our spark apps in Scala.

I just want to know how we can make Batch Indexer to work with Spark as well. Would it be easy to extend Druid to support Spark Batch Indexing ?

Thanks,

Artur

Hi Artur, given the way Spark works, I’m not 100% sure a spark-based indexer would improve things. I know there’s been folks that have succeeded with Spark/Druid integration and you may want to take a look at their work:

https://issues.apache.org/jira/browse/SPARK-11016 is a blocker for spark-druid integration.