Druid and Spark

Hello All,

I have started playing with druid lately. I am still very novice here so please be with me if my question sound funny to you guys :slight_smile:

As I understand, druid is useful for OLAP queries where use can slice, dice and visualize on different dimensions. Also druid uses hdfc or similar technology as deep storage. My question is can we use spark or other hadoop’s ecosystem friend like Mahout to read data directly and run ML or other iterative algorithms? Is there any Reader written for druid segments?

Thanks
Ankush

Hi Ankush,
There has been similar discussion and a proposal out there to use spark as an execution engine for druid queries -

https://groups.google.com/forum/#!topic/druid-development/ULdKYZeven4

you might want to follow up on that.

I am more interested on running spark queries over druid data and that conversation is mostly using spark as a execution engine.

Thanks
Ankush

Druid does have a hadoop input format ( io.druid.indexer.hadoop.DatasourceInputFormat ) that could, in theory, function as an input format for a hadoopRDD, but I haven’t ever actually tried it.

Druid has a lot of internals to handle atomic data replacement needed as a result of its lambda architecture. As such, any execution on druid data that does NOT take that into account needs to be very explicit about what it is hoping to accomplish.

Being able to use druid segments as an RDD or datasources as a DataFrame is on the “cool things to do” list, but on our side we’re making baby steps with the most applicable pain points first.