Benchmarking Apache Druid against Presto and Hive

Hello all,

I recently published this paper (Challenging SQL-on-Hadoop Performance with Apache Druid) with the main goal of benchmarking Druid versus Hive and Presto. I also studied and benchmarked the integration between Druid and Hive in single and Multi-user environments.

My motivation to do so was because we can’t find such a comparison in the literature using Druid and the works available making comparisons with Druid, just compare this technology against MySQL, which is not a typical database for Big Data use cases.

I hope you find the paper interesting and that it can fill the gap in the literature regarding this gap.

Link: https://www.researchgate.net/publication/333831332_Challenging_SQL-on-Hadoop_Performance_with_Apache_Druid

Springer link: https://link.springer.com/chapter/10.1007/978-3-030-20485-3_12

Please feel free to consult it and to give your feedback.

Best regards,

José Correia

Hi José,

This is a great paper. Any chance you can host it somewhere so we can more broadly share it (without a paywall)?

– FJ

Hi Fangjin,

thank you so much for your feedback. I really appreciated.

Before answering your question, I will talk to my colleagues and then I will get back to you.

Meanwhile, I provide the paper to anyone who asks for it through researchgate.

Best regards,

José Correia

Hi again Fangjin,

I added a public version of the paper. Thus, you can share the researchgate link and anyone would be able to read it.

If you share it somewhere else (I would like it) please send me info.

Thanks.

Best regards,

José Correia

Thanks Jose! We’ll look into promoting this through Druid’s public channels.

Thanks for the paper. I gave it a speedy read so I don’t know if I missed it; when I go through the infra section of the paper, I see that you have a 5 node cluster for Hadoop… What was the cluster config for Druid? All I could see was 0.10.1 was used. Could you share the cluster config you had for Druid?

Hi Karthik,

You right.

We just mention the version of Druid and we mention that all the configurations were left by default. Druid was installed via Ambari.

We did this because this way we know that we don’t need to be too concerned about the configurations around Druid. You can just pick the default configurations to start using Druid and the performance will be good.

The only thing we changed was to enable Druid SQL support.

Sorry if you wanted a different answer.

Best regards,

José Correia

Hi~ This is a greate paper. I have a question. Is presto worker node deployed on the HDFS datanode? As I know, when presto worker node deployed with HDFS datenode, the performance will be much better then that when they are deployed seperately.

在 2019年6月28日星期五 UTC+8上午1:27:46,José Correia写道:

Hi,

yes, they are.

100% data locality.

Thanks for the feedback.

Best regards,

José Correia