Histograms aggregators doubts

Hi all,

Currently, I have started to work with histograms aggregators on Druid.

This is my payload tasks:

{

“task”: “index_realtime_rb_location_2016-03-18T09:00:00.000Z_0_0”,

“payload”: {

“id”: “index_realtime_rb_location_2016-03-18T09:00:00.000Z_0_0”,

“resource”: {

“availabilityGroup”: “rb_location-09-0000”,

“requiredCapacity”: 1

},

“spec”: {

“dataSchema”: {

“dataSource”: “rb_location”,

“parser”: {

“type”: “map”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “posix”,

“missingValue”: null

},

“dimensionsSpec”: {

“dimensions”: [

“building_uuid”,

“campus_uuid”,

“client_latlong”,

“deployment_uuid”,

“dot11_status”,

“floor_uuid”,

“market_uuid”,

“namespace_uuid”,

“new”,

“old”,

“organization_uuid”,

“service_provider_uuid”,

“transition”,

“type”,

“zone_uuid”

],

“spatialDimensions”:

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “events”

},

{

“type”: “hyperUnique”,

“name”: “clients”,

“fieldName”: “client_mac”

},

{

“type”: “hyperUnique”,

“name”: “sessions”,

“fieldName”: “session”

},

{

"type": “approxHistogramFold”,

"name": “hist_dwell”,

"fieldName": “dwell_time”,

"resolution": 10000,

"numBuckets": 288,

"lowerLimit": 0,

"upperLimit": 1440

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: {

“type”: “duration”,

“duration”: 60000,

“origin”: “1970-01-01T00:00:00.000Z”

},

“intervals”: null

}

},

“ioConfig”: {

“type”: “realtime”,

“firehose”: {

“type”: “clipped”,

“delegate”: {

“type”: “timed”,

“delegate”: {

“type”: “receiver”,

“serviceName”: “druid:local:firehose:rb_location-09-0000-0000”,

“bufferSize”: 100000

},

“shutoffTime”: “2016-03-18T10:15:00.000Z”

},

“interval”: “2016-03-18T09:00:00.000Z/2016-03-18T10:00:00.000Z”

},

“firehoseV2”: null

},

“tuningConfig”: {

“type”: “realtime”,

“maxRowsInMemory”: 60000,

“intermediatePersistPeriod”: “PT20M”,

“windowPeriod”: “PT10M”,

“basePersistDirectory”: “/tmp/1458288320968-0”,

“versioningPolicy”: {

“type”: “intervalStart”

},

“rejectionPolicy”: {

“type”: “none”

},

“maxPendingPersists”: 0,

“shardSpec”: {

“type”: “linear”,

“partitionNum”: 0

},

“indexSpec”: {

“bitmap”: {

“type”: “concise”

},

“dimensionCompression”: null,

“metricCompression”: null

},

“persistInHeap”: false,

“ingestOffheap”: false,

“aggregationBufferRatio”: 0.5,

“bufferSize”: 134217728

}

},

“context”: null,

“groupId”: “index_realtime_rb_location”,

“dataSource”: “rb_location”

}

}

``

This is my ingest aggregator:

{

"type": “approxHistogramFold”,

"name": “hist_dwell”,

"fieldName": “dwell_time”,

"resolution": 10000,

"numBuckets": 288,

"lowerLimit": 0,

"upperLimit": 1440

}

I use this values because I am trying to show a time histogram based on the minutes of one day (upperLimit = 1440), I try to split the interval on 5 minutes so, I set numBuckets = 1440/5 = 288. I don’t know if the resolution value is correct, I try different values but I get the same results.

The problem is that my input data has a lot of values to dwell_time dimension, like this:

“dwell_time”:219

“dwell_time”:350

“dwell_time”:20

“dwell_time”:2

“dwell_time”:1837

but … when I query the histogram using this query:

{

“queryType”: “groupBy”,

“dataSource”: “rb_location”,

“granularity”: “all”,

“dimensions”: [“campus_uuid”],

“filter”: { “type”: “and”, “fields”:[

{“type”: “selector”, “dimension”: “transition”, “value”: 0},

{“type”: “selector”, “dimension”: “type”, “value”: “campus”}

]

},

“aggregations”:[

{

“type” : “approxHistogramFold”,

“name” : “hist_dwell”,

“fieldName” : “hist_dwell”

}

],

“postAggregations”: [

{

“type”: “customBuckets”,

“name”: “histogram_dwell”,

“fieldName”: “hist_dwell”,

“breaks”: [0,5,10,15,20,25,30,45,50,55,60,90,120,180,240,300,360,420,480,540,600]

}

],

“intervals”:[“2016-03-18T09:00:00/2016-03-18T10:00:00”]

}

``

The result always is between 0 to 30, I can’t see values upper that! But on my events I can see dwell_time values upper 30.

This is the result:

[ {

“version” : “v1”,

“timestamp” : “2016-03-18T09:00:00.000Z”,

“event” : {

“histogram_dwell” : {

“breaks” : [ 0.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 45.0, 50.0, 55.0, 60.0, 90.0, 120.0, 180.0, 240.0, 300.0, 360.0, 420.0, 480.0, 540.0, 600.0 ],

“counts” : [ 289.0, 137.5, 89.0, 66.5, 37.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ]

},

“campus_uuid” : “1781710229480876419”,

“hist_dwell” : {

“breaks” : [ -3.5, 1.0, 5.5, 10.0, 14.5, 19.0, 23.5, 28.0 ],

“counts” : [ 35.0, 273.375, 118.125, 80.625, 62.875, 41.25, 13.75 ]

}

}

} ]

``

I think that I should see some counts different that 0.0 after break 30.0 too, isn’t??

I hope that someone can help me :slight_smile:

Regards,

Andres

Approximate histograms require a lot of tuning and I’ll let someone else who knows them better comment on how to use them. FWIW, you might want to take a look at the new approximate histograms and quantiles coming in https://github.com/druid-io/druid/pull/2660

ok thanks !! I’ll take a look at the new approximate histograms and I will try test some configurations on my currents histograms hahaha

Regards,

Andres