Druid and Spark

Hello All,

I have started playing with druid lately. I am still very novice here so please be with me if my question sound funny to you guys :slight_smile:

As I understand, druid is useful for OLAP queries where use can slice, dice and visualize on different dimensions. Also druid uses hdfc or similar technology as deep storage. My question is can we use spark or other hadoop’s ecosystem friend like Mahout to read data directly and run ML or other iterative algorithms? Is there any Reader written for druid segments?


Hi Ankush,
There has been similar discussion and a proposal out there to use spark as an execution engine for druid queries -


you might want to follow up on that.

I am more interested on running spark queries over druid data and that conversation is mostly using spark as a execution engine.


Druid does have a hadoop input format ( io.druid.indexer.hadoop.DatasourceInputFormat ) that could, in theory, function as an input format for a hadoopRDD, but I haven’t ever actually tried it.

Druid has a lot of internals to handle atomic data replacement needed as a result of its lambda architecture. As such, any execution on druid data that does NOT take that into account needs to be very explicit about what it is hoping to accomplish.

Being able to use druid segments as an RDD or datasources as a DataFrame is on the “cool things to do” list, but on our side we’re making baby steps with the most applicable pain points first.