Accuracy Loss


According to the teacher’s explanation and the online instructions, one problem was mentioned: loss of precision. I don’t know too much about the bottom layer of druid. May I ask how is this going? Thanks!

Druid is super fast analytics database. Accuracy is tunable in Druid. Why do we tradeoff performance with accuracy?

Let’s understand this in more detail

  • We want to build a system which can do real-time analytics. Ingest data from streams and perform query on the fly.

  • Computing Cost of COUNT(*), SUM(Column1), AVG(MyColumn) is very high due to the fact there’s data ingested through real-time streams e.g. Kafka, Kenesis

  • We are talking about petabytes of data with billions and billions of records. No limit is set on how much data we can store in one cluster. People have gone beyond thousands of nodes cluster and it works fine.

  • imagine a scenario to perform count number of rows e.g. SELECT COUNT(*) FROM MYDATA. Now system has to go to entire data source and count all rows (Meanwhile data is also coming from streams) so system has to keep counting and query will NEVER FINISH :slight_smile:

  • Let’s think we figured out a solution for above scenario. Now if our solution works for count() and it takes RT response time with R resources for U number of users then RT will increase in case if we add more users e.g. U2. So we need to optimize our system response time with best we can do with given resources.

These are optimization challenge for extremely large scale real-time analytics systems. Druid offers tunable accuracy when users execute extensive data intensive queries. You can tell the system to use Sketches for faster response which will be 100% accurate within a dataset of 1000 rows (1000 is default but you can change this value). As dataset gets bigger, the accuracy is around 95%. If you increase this sketch accuracy then sketch will sample more data and your storage cost will increase.

You can tell within query not to use sketches. If you do so then query will take more time and resources but will give you 100% accurate value. Highly recommended to do such query when data you need to process is smaller. Here is one idea so you can connect all dots

  • I want to find top 10 countries using Netflix (Doesn’t need 100% accuracy and performs aggregations)

  • I want to calculate top 10 products (by highest gross sales). Use sketches and you will get response really fast because of sketches)

  • I want to calculate order total for a customer (Needs accuracy but dataset is limited to one customer. This you can run without sketches)

Hope this was helpful.