Send batch data from external spark cluster to druid cluster

Hi, I want to ask a question about sending batch data to Apache Druid.I have a Druid cluster(druid in emr, with hdfs etc), and external spark cluster. Druid allows ingest batch data via map-reduce. But it allows that with own resources and deep storage(yarn, hdfs etc).
But I want to send analyzed batch data from external spark to druid. I want to use Druid only for real-time ingestion and queries. Does druid support that? If it does not, I want to implement this. How can I solve this problem?
Example flow:
Spark Cluster(only spark) => <Batch Data> => Druid Cluster(with HDFS, YARN)

Where is the data produced by the Spark application resides?

You mentioned EMR, so I’m assuming you’re working with AWS.

If that’s the case, you can write the processed data from Spark to S3, and then load it into Druid (Druid can use S3 as its deep storage).


Hi Itai,

Thanks for your answer. Yes I can write data to the s3, but then I have to load trigger an hadoop ingestion spec to load data from s3 to datasource, right? I do not want to trigger extra ingestion spec after I wrote the data. Can I write data from spark to s3 with druid segment format? After writing data from spark to s3 with druid segment format, does druid automatically recognize this segments(also creates datasource automatically) without trigger an ingestion spec?

itai yaffe, 11 Şub 2020 Sal, 15:43 tarihinde şunu yazdı:

Yes, if you write the data to S3, you’ll have to trigger a Hadoop ingestion task to load the data into your data source.
There’s a third-party project to load data from Spark to Druid ( rather than using Hadoop MapReduce, but it’s outdated AFAIK.

As per directly writing data from Spark to S3 in a Druid format, that Druid will automatically recognize - unfortunately, I’m not familiar with such an option (I guess there might be ways to achieve that, but they won’t be straight-forward).

Having said that, if your processed data (the data written by Spark) is already aggregated, the Hadoop ingestion task should be quite fast.

We solved this by building direct writers from Spark to Druid (e.g. write segment files directly from Spark, update the segment metadata in MySQL from Spark, etc.). You can adapt the druid-spark-batch code Itai shared above to do something similar as well.

I have looked spark-batch-druid code of metamx. It is a good example for that. I want to learn that any extensions exist or not. As well as I understand, there is no example for that. I will implement a segment writer for spark to sent data to the Druid deep storage directly(segment format), and any other related operations such as metadata store update without trigger any extra ingestion spec

11 Şub 2020 Sal 21:02 tarihinde ‘Julian Jaffe’ via Druid User şunu yazdı:

Any chance there are materials you can share on what you did?

Most of the code is pretty heavily customized to our use case (I made some fairly extensive modifications to Druid ShardSpecs and segment identifiers to support some of our use cases) but a lot of the code that would be generally applicable is just updates to the druid-spark-batch code to pull out the actual segment writing logic and metadata updates. It isn’t the most performant code, so I parallelized the IndexBuilding and the like, but even still there’ll be tension between the size and number of partitions Druid wants to operate with and the size and number of output partitions Spark wants to work with. If you’re building your own Druid distribution, I’d also highly recommend adding a OnHeapMemorySegmentWriteOutMediumFactory corresponding to OnHeapMemorySegmentWriteOutMedium (just follow the same pattern as the already existing OffHeapMemorySegmentWriteOutMediumFactory) and using that in Spark to make resource allocation easier on your cluster management. Sorry for the vagueness here, and hopefully I’ll have something shareable in the next month or so!

Thanks for the details!
Now that I think of it, some of this work has been described in a meetup your company held, am I right?

Also, looks like there’s interest from the community to have a functioning Spark batch ingestion (and not just a Hadoop batch ingestion), so I wonder if that’s something we can all work together to contribute to Druid.

Since you’re using AWS, have you considered Spark -> Amazon Kinesis (or Apache Kafka) -> Druid ? Druid has built-in capabilities to ingest in real-time from Kinesis/Kafka. Your Spark jobs could then be either batch or streaming based with your data published to Kinesis/Kafka.

For those who follow this thread and are interested in the option of using Spark batch to ingest data into Druid, we started a parallel conversation in Druid’s Slack channel - You’re welcome to join :slight_smile:

This conversation was moved to Druid’s dev mailing list, see