Is TPC-DS suitable to test Druid?

Hi,

I would like to do a benchmark test against Druid. There are two test sets, TPC-H and TPC-DS. Which one will be the suitable one to test Druid?

And furthermore, I know Druid put the data in HDFS, what will be the best advice of the hardware of the node? If I want to test Druid on 100T data,

how many nodes should be prepared?

Thanks a lot.

Josh

Hey Josh,

At Imply we are currently working on some query benchmarks for publication, likely using a denormalized version of SSB (https://www.cs.umb.edu/~poneil/StarSchemaB.PDF), which is derived from TPC-H. A couple of interesting points that will come out in our eventual publication will include,

  1. The importance of vectorization — today (in Druid 0.18) not 100% of the SSB queries can be vectorized (see https://druid.apache.org/docs/latest/querying/query-context.html#vectorization-parameters) but for the ones that can, it makes a huge difference.

  2. The importance of making sure Druid’s builtin time filtering features get activated (see https://druid.apache.org/docs/latest/querying/sql.html#time-filters). In some cases this involves modifying the query slightly.

In terms of scale, I’d suggest loading a small set of data to see what sort of compression you get (i.e. 1B rows = X GB in Druid) and then extrapolating from there to figure out how many nodes you need. For reference, in our experience, 12 billion rows of denormalized SSB data ended up requiring about 3.6TB of Druid storage when replicated. You should consider whether you want to do an in-memory benchmark (in which case you need that much memory) or not (in which case you need that much disk). Either way you should ideally be using SSD disks, like i3 class instances in EC2. And the workloads tend to be CPU-bound so you should make sure you have a good amount of CPUs.

Gian

Hi Josh,
I’m a developer evangelist at Imply and I’m currently running SSB, which is based on TPC-H. The SSB paper https://www.cs.umb.edu/~poneil/StarSchemaB.PDF has a little discussion of why they felt that TPC-H needed to be updated. In these tests, I’m attempting to demonstrate Druid’s suitability for EDW workloads. When I’m finished with SSB, I’m going to benchmark using streaming data.

What are you trying to demonstrate in your environment?

Are you saying you want to read in 100TB of data? You’re going to need quite a cluster to ingest that. Based on what I learned running SSB, you could need 4x that in disk space to create the Druid segments.

Thanks,

Matt

Hi Gian

Thank you so much. The information you provided is very useful. I will try to deploy druid on such node.

Thanks.

Josh

在 2020年5月8日星期五 UTC+8上午12:57:23,Gian Merlino写道:

Hi Matt,

Actually, our company just begin to do olap compute, so we’d like to do a benchmark test on both Druid and Doris. I have read the SSB pdf, it is a great document.

If it is OK, I wonder if you can let me know your steps to execute the SSB, such as how to generate enough data and how to load them to Druid. If these staff can not be open now, that is all right.

And I should complete the Druid test in five days, including Druid deploying, test data generating(including three data scale, 500G, 10T, 100T), test data loading, query executing ,etc. According to your experience, is five days OK to do all these thing?. By the way, I have experice in big data compute, but new to olap and Druid.

Thanks a lot.

Josh

在 2020年5月8日星期五 UTC+8上午1:07:01,Matt Sarrel写道: