Managing epochs in Druid (basic data types in Druid)

Hi,

One of the fields we want to include into our data is epoch time. However, when retrieving values for that field, we get wrong results. For instance, from epoch 1432912477 we get 1432912512.

Although the “longSum” operation returns 64 bit unsigned integers, it seems that the “basic” data types (i.e. data types from original data), do not include longs.

Has anyone worked with epochs and encountered similar problems?

What are the “basic” data types in Druid? String, float and integer? String and float only?

Thanks in advance,

Miquel

Hi Miquel,
Druid supports String for dimensions and Float or Long for aggregators dependinding on what aggregator is specified during ingestion.

Can you provide more details on what version of druid you are using and how are you ingesting and retrieving the data ?

Hi Nishant,

Thank you for your reply. Currently, we are using Druid version 0.7.3. We are ingesting data through CSV files (we are still testing if Druid is suitable for our needs). The data looks like this:

2014,12,30,0,REST,CUS,BCEN,2014-12-30T00:00:00Z,b3909acf0af5509b247b23d1,a3909acf0af5509b247b23d3,33909acf0af5509b247b23d2,23909acf0af5509b247b23d1,Android,4.1,1432912477,1

2014,12,29,0,REST,CUS,BCEN,2014-12-29T00:00:00Z,b3909acf0af5509b247b23d1,a3909acf0af5509b247b23d3,33909acf0af5509b247b23d2,23909acf0af5509b247b23d2,Android,4.1,1432912477,2

2014,12,30,0,REST,CUS,BCEN,2014-12-30T00:00:00Z,b3909acf0af5509b247b23d1,a3909acf0af5509b247b23d3,33909acf0af5509b247b23d2,23909acf0af5509b247b23d3,Android,4.1,1432912477,3

2014,12,30,0,REST,CUS,BCEN,2014-12-30T00:00:00Z,b3909acf0af5509b247b23d1,a3909acf0af5509b247b23d3,33909acf0af5509b247b23d2,23909acf0af5509b247b23d4,iOS,8.1,1432912477,4

2014,12,28,0,REST,CUS,BCEN,2014-12-28T00:00:00Z,b3909acf0af5509b247b23d1,a3909acf0af5509b247b23d3,33909acf0af5509b247b23d2,23909acf0af5509b247b23d5,iOS,7.3,1432912477,5

The ingestion is made via indexing, using the following JSON:

{

“type” : “index”,

“spec” : {

“dataSchema” : {

“dataSource” : “events”,

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “csv”,

“timestampSpec” : {

“column” : “timestamp”,

“format” : “auto”

},

“columns”: [“year”, “month”, “day”, “hour”, “source”, “type”, “code”, “timestamp”, “clientid”, “applicationid”, “campaignid”, “venueid”, “state”, “userid”, “ostype”, “osversion”, “epoch”, “num”],

“dimensionsSpec” : {

“dimensions”: [“source”, “type”, “code”, “clientid”, “applicationid”, “campaignid”, “venueid”, “state”, “userid”, “ostype”, “osversion”],

“dimensionExclusions” : ,

“spatialDimensions” :

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “longSum”,

“name”: “mepoch”,

“fieldName”: “epoch”

},

{

“type”: “longSum”,

“name”: “mnum”,

“fieldName”: “num”

},

{

“type”: “max”,

“name”: “maxepoch”,

“fieldName”: “epoch”

},

{

“type”: “min”,

“name”: “minepoch”,

“fieldName”: “epoch”

}

],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “DAY”,

“queryGranularity” : “NONE”,

“intervals” : [ “2013-01-01/2016-01-01” ]

}

},

“ioConfig” : {

“type” : “index”,

“firehose” : {

“type” : “local”,

“baseDir” : “PoC/”,

“filter” : “events_data.csv”

}

},

“tuningConfig” : {

“type” : “index”,

“targetPartitionSize” : 0,

“rowFlushBoundary” : 0

}

}

}

And the query we issue is:

{

“queryType”: “groupBy”,

“dataSource”: “events”,

“granularity”: “day”,

“dimensions”: [“userid”],

“aggregations”: [

{“type”: “count”, “fieldName”: “userid”, “name”: “user”},

{“type”: “longSum”, “fieldName”: “mnum”, “name”: “sum”},

{“type”: “min”, “fieldName”: “mepoch”, “name”: “minnum”},

{“type”: “max”, “fieldName”: “mepoch”, “name”: “maxnum”},

{“type”: “min”, “fieldName”: “maxepoch”, “name”: “minnumepoch”},

{“type”: “max”, “fieldName”: “minepoch”, “name”: “maxnumepoch”},

{“type”: “count”, “fieldName”: “epoch”, “name”: “countepoch”}

],

“postAggregations”: [ {

“type” : “arithmetic”,

“name” : “dwell_time_user”,

“fn” : “-”,

“fields” : [

 { "type" : "fieldAccess", "fieldName" : "maxnum" },

 { "type" : "fieldAccess", "fieldName" : "minnum" }

]

}] ,

“intervals”: [ “2010-01-01/2020-01-01” ]

}

Which gives results such as:

{

“version” : “v1”,

“timestamp” : “2014-12-29T00:00:00.000Z”,

“event” : {

“minnumepoch” : 1.432912512E9,

“dwell_time_user” : 0.0,

“maxnum” : 1.432912512E9,

“minnum” : 1.432912512E9,

“userid” : “23909acf0af5509b247b23d2”,

“sum” : 2,

“countepoch” : 1,

“maxnumepoch” : 1.432912512E9,

“user” : 1

}

Here for instance you can see that for that day and userid the epoch value retrieved differs from the data.

Miquel

El dimecres, 3 juny de 2015 13:06:44 UTC+2, Nishant Bangarwa va escriure:

I think it is happening because max/min aggregators use “double” type during the aggregation. We have introduced longMax/longMin aggregators to correct this which will be available in 0.8.0 .

If you are just experimenting then you can build druid from master and see if using longMax/longMin aggregators show you the correct behavior.

– Himanshu