Count(distinct) performance

We have many BI solutions struggle with exact count(distinct) on our datasets.
How well Druid’s exact count distinct performs on highly cardinal data?

One of most common use cases would be up to 0.6 billion unique values over tens of billions of records.

We can have multiple exact count distincts over different similar columns in one single SELECT statement.

Thank you!

Ruslan

Hey Ruslan,

The performance of exact count distinct will generally scale with how many unique values you have. For 0.6 billion unique values, properly tuned, the query times will probably be in the 5-60 second range. However, Druid can do approximate counts of 0.6 billion unique values in less than a second. When you have that many values, I’d really consider whether approximations are acceptable: they make a big difference in terms of performance and memory use.

Gian

Hey Gian,

Thank you! That’s very helpful. We will get Druid (and imply.io) a try - hope they happen to be compatible with CDH 5.14 Hive, Hadoop etc versions there.

Ruslan

Getting Druid to run against different Hadoop distros can feel like wrestling a pig, due to potential dependency conflicts, but is generally possible in the end. Check out http://druid.io/docs/latest/operations/other-hadoop.html for some tips. Another ‘ultimate’ tip that seems to work when all else fails is to rebuild Druid against the target version of Hadoop. That almost always works. (Or, equivalently, replace all the Hadoop jars in Druid’s hadoop-dependencies/hadoop-client and extensions/druid-hdfs-storage directories with the equivalent jar versions from your Hadoop vendor.)