I have few doubts about druid histogram extension.I wrote a python scrip(generator.py)t to create test data of 1M rows.
Using the same data I created 3 new datasources with varying resolution as follows:
- histogram_test_50 - resolution 50
- histogram_test_100 - resolution 100
- histogram_test_300 - resolution 300
Number of buckets was 7 for all of them.
No roll up was enabled on the data.
I ran a query to get the quantiles for 0.1,0.2…0.9 on the above datasources and compared it with the actual values.
I didn’t see any difference between the results returned for the three datasources.
This was contrary to the expectation that accuracy increases with resolution.
I also increased the number of buckets from 7 to 30(histogram_test_300) but there was no change in the results.
Is there anything that I’m missing out here ?
I observed that in the output of post aggregation the length of “breaks” field is a constant(8). Increasing the bucket size also doesn’t change it’s length.Is there any specific reason for that?
The ingestion_spec.json has the configuration I used to ingest data for histogram_test_300.It ingests form /tmp/test_data.For the other datasources, I just varied the resolution and buckets.
generator.py (584 Bytes)
ingestion_spec.json (1.41 KB)
histogram_test_100_result.json (1.55 KB)
histogram_test_100_result.json (1.55 KB)
query.json (940 Bytes)
calculate_quantile.py (381 Bytes)