[druid-user] Data sketch rollup gives huge segments

Hello!

Each 5 minutes I receive around 20 mil of data points having 1 metric and 2 dimesions. I get the data in real-time from Kafka. Metric itself are delta values (traffic counter delta, for example).

End goal is to have 1 datapoint per 6 hours with ability to calculate p95 on values along ide with min and max.

I have created a datasource (full spec) that has queryGranularity set to 6 hours and the following metric specs:
"metricsSpec": [
{
"type": “count”,
"name": "count"
},
{
"type": “longMax”,
"name": “max_value”,
"fieldName": "value"
},
{
"type": “longMin”,
"name": “min_value”,
"fieldName": "value"
},
{
"type": “longSum”,
"name": “sum_value”,
"fieldName": "value"
},
{
"type": “quantilesDoublesSketch”,
"name": “qds”,
"fieldName": “value”,
"k": 128
}
],
"granularitySpec": {
"type": “uniform”,
“segmentGranularity”: “SIX_HOUR”,
"queryGranularity": “SIX_HOUR”,
"rollup": true,
"intervals": null
},

I have compaction rule set as well to produce hashed partition type with 1 shard to really get in the end 1 segment with 1 data point per 6 hours.

And in the end indeed I am getting 1 data point per 6 hours, “count” is equal to 288 as it should be. The only problem is segment size - it’s about 12Gb!

While looking further I found that “qds” values for single event differs a lot in size from ones found after compaction and rollup. If at ingestion time “qds” becomes a string with 57 characters then after all compaction and rollups for a single row having “count” equal to 288 “qds” is a string with 813 characters!

So I have basically 3 question:

  1. What am I doing wrong?
  2. How does “qds” actually rolls up? I have a feeling that 813 chars instead of 57 indicates that something is wrong.
  3. How to achieve my end goal to have 1 data point per 6 hours with ability to calculate min, max and p95 on supplied metric?

I would like to know if there is an option to achieve it using real-time ingestion instead of “reindex from Druid” option.

Thank you!