Druid- huge difference in response time within the same data source.

Hi guys,

Let me ask you for a help with Druid or data configuration. We use Druid to store denormalized raw data and count cardinality based on selected dimensions. The issue is we have too much different response time (time to calculate cardinality) within the same data source: for one set dimension it takes up to 1 second to get cardinality, for another set dimension it takes more than 30 seconds. Let me clarify environment details and the difference in data…

Environment: 16 nodes (amazon i2.2xlarge)

Dates: 01/01/2015 - 10/31/2016

Important thing: all this data is taken from a number of tables. I mean we took 40 columns from the set of tables and put them all into one row - 40 dimensions represents the names of those columns. Since there are a lot of tables that are not joined, it is quite common situation to have only a number of dimensions filled with values for one single row of raw data (this is how the data we ingest into the Druid looks like):

table1.value1|table1.value2|table1.value3|table1.value4|table1.value5|table1.value6||||||||||||||||||||||||date|user - we have ~90% of such rows

table2.value1|table2.value2|table2.value3|||||||date|user - we have ~10% of such rows

As you see, in the first row we collected data only from table1 (6 dimensions), date and user (these two are presented in each and every single row).

The second row contains only data from table2 (3 dimensions), date and user, all others are absent.

In this way we denormalized the data from a number of tables and have all data within the same structure (40 dimensions). In fact, as of now we have the data only from two tables, but the entire row structure contains dimensions for all tables that are supposed to be used. For such tables he data is absent, but the dimensions are there.

The issue: We tried to calculate cardinality and hyperUnique only for rows, that have the data from table1 (~90% of data). We set one dimension (table1.value1) in our query and got results in less than a second: 0.895 seconds.

We did the same for other rows, related to table2 (~10% of data): we set one dimension (table2.value1) in our query and got results in more than 30 seconds: 32.416 seconds.

May I ask Druid professionals to help us understand such behavior and the way to make if faster?

Thank you in advance!!!

Hi Andriy, there can be many many different explanations of why queries perform differently.

My best suggestion is to post the metrics associated with these two different queries at the broker and historical level to understand where the bottleneck is.

The relevant query metrics are:

http://druid.io/docs/latest/operations/metrics.html

I think it would also be nice to make sure there are no other queries happening in your cluster outside of the two direct queries you want to compare. The per segment scan times of both queries would be the most interesting. You can reduce your intervals to get the scan times on a per segment level.

Hi Fangjin,

Thank you for the advice, we will work with Druid metrics closely in order to understand the root cause of such behavior.

Thank you,

Andriy