Any plans to integrate SparkSQL DataFrames/DataSets on top of druid

This will allow to run custom aggregations on druid query resultset across cluster in distributed fashion. for example, Percentile calculation would be possible or any other custom aggregation that is not supported by SQL or druid.

I second this. Having the ability to run real-time complex calculations on the data would be immensely useful.

AWS QuickSight just introduced the ability to run queries against EMR that made my team very interested in making that available through a custom router in our infrastructure or running Spark queries atop Druid

If Spark is not a requirement, there’s also been some work on Calcite and Hive to allow them to run on top of Druid query results. The most recent releases of both do have that functionality.

And finally, Druid does support approximate percentiles, although not exact ones.

Do you mean on top of Druid results or on top of Druid segment files from deep storage? There was some work done on the former at https://github.com/SparklineData/spark-druid-olap, though I’m not sure what the current status is. For the latter you can get an RDD, at least, by using https://github.com/implydata/druid-hadoop-inputformat. You might be able to wrap that in a DataFrame somehow.

Both approach works as they can both(segments and druid result) work on distributed segments. Is the performance of DruidHadoopInputFormat to query druid data and returning spark rdd matchable against query druid data and returning druid results?

We have accurate percentile requirements at a moment, glm coeff calculation in real-time etc. But none of those are slow stopper.

Is the performance of DruidHadoopInputFormat to query druid data and returning spark rdd matchable against query druid data and returning druid results?

Running against Druid query results should be faster, since Druid already has the data files loaded up and the InputFormat would have to pull them from HDFS.

there’s also this one: tps://de.hortonworks.com/blog/apache-hive-druid-part-1-3/
(I’m not using it though, so I cannot say if it’s any good. Still holding out until there is hopefully OLAP cubing support in Druid itself. poke poke…:wink:

Thanks Sascha

this is actually the working link https://hortonworks.com/blog/apache-hive-druid-part-1-3/

Totally agree with you that the cubing part still missing that’s why for the moment we are taking the fasts route (use AtScale), although this is not a 100% open source solution :(.

Would love to see an open source alternative plans.

What would “OLAP cubing support in Druid itself” look like to you and how would you want to use it if it existed?

Hi,

I’m testing druid-hadoop-inputformat and have some questions.

Would someone kindly answer to this post?

https://groups.google.com/forum/#!topic/druid-user/cNP5_lprajs

Regards,

Jason

2017년 5월 12일 금요일 오전 2시 11분 31초 UTC+9, Gian Merlino 님의 말: