My Bachelor degree thesis: Apache Hive and Apache Druid performance testing for MIND Foods HUB Data Lake

Then, I want to share my Bachelor’s degree thesis work for the “Sicurezza dei Sistemi e delle Reti Informatiche” course, where I discuss the performance evaluation between Apache Hive and Apache Druid for the MIND Foods HUB Data Lake.

MIND Foods HUB is an interdisciplinary project to " implement a computational infrastructure to model, engineer and distribute data about plant phenotyping".

To store the phenotyping data to support various scientific analyses, MIND employs an Apache Hive instance, which sadly suffers from low maintainability and scarce aggregation performance.

My research goal was to test Apache Druid as a more performant alternative than Hive. For this purpose, I performance tested both platforms with Apache JMeter, following an approach similar to the one applied by Imply for the price-performance benchmark between Druid and Google BigQuery. Still, instead of relying on the Star Schema Benchmark, I preferred to design a workload representative of how the platform is used on the field, generating a dataset that sticks to the data model of the MIND Foods HUB table, using the actual SQL queries of the analysis processes.

If you are interested, here’s the GitHub repository with all the work that I did for the research:

While here you can find the PDF of my thesis:

Tuesday will be the day of the discussion, so wish me luck, guys!

