Working of Druid Histogram Extension

I have few doubts about druid histogram extension.I wrote a python scrip(generator.py)t to create test data of 1M rows.
Using the same data I created 3 new datasources with varying resolution as follows:

  1. histogram_test_50 - resolution 50
  2. histogram_test_100 - resolution 100
  3. histogram_test_300 - resolution 300
    Number of buckets was 7 for all of them.

No roll up was enabled on the data.
I ran a query to get the quantiles for 0.1,0.2…0.9 on the above datasources and compared it with the actual values.
I didn’t see any difference between the results returned for the three datasources.
This was contrary to the expectation that accuracy increases with resolution.
I also increased the number of buckets from 7 to 30(histogram_test_300) but there was no change in the results.
Is there anything that I’m missing out here ?
I observed that in the output of post aggregation the length of “breaks” field is a constant(8). Increasing the bucket size also doesn’t change it’s length.Is there any specific reason for that?

The ingestion_spec.json has the configuration I used to ingest data for histogram_test_300.It ingests form /tmp/test_data.For the other datasources, I just varied the resolution and buckets.

generator.py (584 Bytes)

ingestion_spec.json (1.41 KB)

histogram_test_100_result.json (1.55 KB)

histogram_test_100_result.json (1.55 KB)

query.json (940 Bytes)

calculate_quantile.py (381 Bytes)

Hi Sharath,

The ApproximateHistogram has significant accuracy issues, which are described here: https://datasketches.github.io/docs/Quantiles/DruidApproxHistogramStudy.html

Can you try the Datasketches quantiles agg instead? http://druid.io/docs/latest/development/extensions-core/datasketches-quantiles.html

Thanks,

Jon

Thanks for the suggestion.It is giving reliable results compared to histogram extension.