Let me ask you for a help with Druid or data configuration. We use Druid to store denormalized raw data and count cardinality based on selected dimensions. The issue is we have too much different response time (time to calculate cardinality) within the same data source: for one set dimension it takes up to 1 second to get cardinality, for another set dimension it takes more than 30 seconds. Let me clarify environment details and the difference in data…
Environment: 16 nodes (amazon i2.2xlarge)
Dates: 01/01/2015 - 10/31/2016
Important thing: all this data is taken from a number of tables. I mean we took 40 columns from the set of tables and put them all into one row - 40 dimensions represents the names of those columns. Since there are a lot of tables that are not joined, it is quite common situation to have only a number of dimensions filled with values for one single row of raw data (this is how the data we ingest into the Druid looks like):
table1.value1|table1.value2|table1.value3|table1.value4|table1.value5|table1.value6||||||||||||||||||||||||date|user - we have ~90% of such rows
table2.value1|table2.value2|table2.value3|||||||date|user - we have ~10% of such rows
As you see, in the first row we collected data only from table1 (6 dimensions), date and user (these two are presented in each and every single row).
The second row contains only data from table2 (3 dimensions), date and user, all others are absent.
In this way we denormalized the data from a number of tables and have all data within the same structure (40 dimensions). In fact, as of now we have the data only from two tables, but the entire row structure contains dimensions for all tables that are supposed to be used. For such tables he data is absent, but the dimensions are there.
The issue: We tried to calculate cardinality and hyperUnique only for rows, that have the data from table1 (~90% of data). We set one dimension (table1.value1) in our query and got results in less than a second: 0.895 seconds.
We did the same for other rows, related to table2 (~10% of data): we set one dimension (table2.value1) in our query and got results in more than 30 seconds: 32.416 seconds.
May I ask Druid professionals to help us understand such behavior and the way to make if faster?
Thank you in advance!!!