difference between interval in GranularitySpec and interval in Query

Hi All,

Today with help from response i got from this group we were able to load and index batch data and also were able to retrieve it thru query. However i am bit confused with couple of things.

  1. Interval ( GranularitySpec) :
  • Is this a where clause while loading data into druid, meaning does it bring data only between that dates/time stamps mentioned in the interval?
  • Or is it segment identifier ? meaning if i have 90 rows of data and i mention interval as 2015-01-01/2015-01-31,then appx 3 rows of data gets distributed into each day of the interval and stored as a segment and each date in the interval acts as an identifier to the segment.
  • Interval ( Query)
  • Does the range mentioned in interval acts as where clause against the timestamp or
  • where clause against segment dates.

Please let me know. if the questions seems too confusing i can send the examples of my index json and query.

Karteek

Heya,

  1. Yes, you can say it like this. So if you have an “hourly” interval, then you will create “hourly” segments, with only containing the data for each hour

  2. Yep, its a range. First it finds segments matching this range, and then within each segment it filters rows by that range.

ok i have couple more question on query output i got. in need to understand the significance of version and timestamp in query output.

before that i want to list out the steps we did for data ingestion and query output.

Data:

3/7/2015^20^1366318122374311707 ^2216341272195416704 ^0^N^1514^Y^USD^Unbranded^87.9900^1^1^1.8300^0.0000^0.0000^0.0000^0.0000^0.0000^1.8300^0^0.483^0^0^0^87.99

3/27/2015^13^1936640405267163739 ^2293377933990536992 ^0^N^20032^Y^USD^Branded^15.0000^1^1^0.5300^0.0000^0.0000^0.0000^0.0000^0.0000^0.5300^0^0.2771^0^0^0^15

Indexing: ( i am putting in only granularitySpec and not the whole index.json file)

granularitySpec": {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “DAY”,

“intervals”: [

“2015-01-01/2015-04-01”

]

}

Segment_id from druid-segment table

QBRData_2015-03-07T00:00:00.000Z_2015-03-08T00:00:00.000Z_2015-07-10T22:50:50.890Z

QBRData_2015-03-27T00:00:00.000Z_2015-03-28T00:00:00.000Z_2015-07-10T22:50:50.890Z

Query:

{

“queryType”: “groupBy”,

“dataSource”: “QBRData”,

“granularity”: “all”,

“dimensions”: [

“transaction_reference_date”,

“time_of_day”,

“rcvr_id”,

“sndr_id”,

“pkey”,

“is_pmt_xbdr_y_n”,

“pmt_flow_key”,

“is_pmt_official_y_n”,

“transaction_currency”,

“Transaction_Type”

],

“aggregations”: [

{

“type”: “doubleSum”,

“fieldName”: “NTPV_USD_AMT”,

“name”: “NTPV_USD_AMT”

},

{

“type”: “doubleSum”,

“fieldName”: “NET_CNT”,

“name”: “NET_CNT”

},

{

“type”: “doubleSum”,

“fieldName”: “SUCCESS_CNT”,

“name”: “SUCCESS_CNT”

}

],

“intervals”: [

“2015-03-01/2015-04-01”

]

}

Output:

[ {

“version” : “v1”,

“timestamp” : “2015-03-01T00:00:00.000Z”,

“event” : {

“transaction_currency” : “USD”,

“rcvr_id” : "1936640405267163739 ",

“Transaction_Type” : “Branded”,

“SUCCESS_CNT” : 1.0,

“pkey” : “0”,

“NET_CNT” : 1.0,

“time_of_day” : “13”,

“sndr_id” : "2293377933990536992 ",

“NTPV_USD_AMT” : 15.0,

“is_pmt_official_y_n” : “Y”,

“pmt_flow_key” : “20032”,

“is_pmt_xbdr_y_n” : “N”,

“transaction_reference_date” : “3/27/2015”

}

}, {

“version” : “v1”,

“timestamp” : “2015-03-01T00:00:00.000Z”,

“event” : {

“transaction_currency” : “USD”,

“rcvr_id” : "1366318122374311707 ",

“Transaction_Type” : “Unbranded”,

“SUCCESS_CNT” : 1.0,

“pkey” : “0”,

“NET_CNT” : 1.0,

“time_of_day” : “20”,

“sndr_id” : "2216341272195416704 ",

“NTPV_USD_AMT” : 87.98999786376953,

“is_pmt_official_y_n” : “Y”,

“pmt_flow_key” : “1514”,

“is_pmt_xbdr_y_n” : “N”,

“transaction_reference_date” : “3/7/2015”

}

} ]

Now the question i have is

  1. “version” : “v1” : what does this version stand for.
  2. Why is time stamp always the 1st date in query interval., in this case

Karteek

Hi Karteek, comments inline.

ok i have couple more question on query output i got. in need to understand the significance of version and timestamp in query output.

before that i want to list out the steps we did for data ingestion and query output.

Data:

3/7/2015^20^1366318122374311707 ^2216341272195416704 ^0^N^1514^Y^USD^Unbranded^87.9900^1^1^1.8300^0.0000^0.0000^0.0000^0.0000^0.0000^1.8300^0^0.483^0^0^0^87.99

3/27/2015^13^1936640405267163739 ^2293377933990536992 ^0^N^20032^Y^USD^Branded^15.0000^1^1^0.5300^0.0000^0.0000^0.0000^0.0000^0.0000^0.5300^0^0.2771^0^0^0^15

Indexing: ( i am putting in only granularitySpec and not the whole index.json file)

granularitySpec": {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “DAY”,

“intervals”: [

“2015-01-01/2015-04-01”

]

}

The minimum granularity you will be able to get results back will be daily results. Is this what you want or do you want more fine grained queries?

Segment_id from druid-segment table

QBRData_2015-03-07T00:00:00.000Z_2015-03-08T00:00:00.000Z_2015-07-10T22:50:50.890Z

QBRData_2015-03-27T00:00:00.000Z_2015-03-28T00:00:00.000Z_2015-07-10T22:50:50.890Z

Query:

{

“queryType”: “groupBy”,

“dataSource”: “QBRData”,

“granularity”: “all”,

“dimensions”: [

“transaction_reference_date”,

“time_of_day”,

“rcvr_id”,

“sndr_id”,

“pkey”,

“is_pmt_xbdr_y_n”,

“pmt_flow_key”,

“is_pmt_official_y_n”,

“transaction_currency”,

“Transaction_Type”

],

“aggregations”: [

{

“type”: “doubleSum”,

“fieldName”: “NTPV_USD_AMT”,

“name”: “NTPV_USD_AMT”

},

{

“type”: “doubleSum”,

“fieldName”: “NET_CNT”,

“name”: “NET_CNT”

},

{

“type”: “doubleSum”,

“fieldName”: “SUCCESS_CNT”,

“name”: “SUCCESS_CNT”

}

],

“intervals”: [

“2015-03-01/2015-04-01”

]

}

Output:

[ {

“version” : “v1”,

“timestamp” : “2015-03-01T00:00:00.000Z”,

“event” : {

“transaction_currency” : “USD”,

“rcvr_id” : "1936640405267163739 ",

“Transaction_Type” : “Branded”,

“SUCCESS_CNT” : 1.0,

“pkey” : “0”,

“NET_CNT” : 1.0,

“time_of_day” : “13”,

“sndr_id” : "2293377933990536992 ",

“NTPV_USD_AMT” : 15.0,

“is_pmt_official_y_n” : “Y”,

“pmt_flow_key” : “20032”,

“is_pmt_xbdr_y_n” : “N”,

“transaction_reference_date” : “3/27/2015”

}

}, {

“version” : “v1”,

“timestamp” : “2015-03-01T00:00:00.000Z”,

“event” : {

“transaction_currency” : “USD”,

“rcvr_id” : "1366318122374311707 ",

“Transaction_Type” : “Unbranded”,

“SUCCESS_CNT” : 1.0,

“pkey” : “0”,

“NET_CNT” : 1.0,

“time_of_day” : “20”,

“sndr_id” : "2216341272195416704 ",

“NTPV_USD_AMT” : 87.98999786376953,

“is_pmt_official_y_n” : “Y”,

“pmt_flow_key” : “1514”,

“is_pmt_xbdr_y_n” : “N”,

“transaction_reference_date” : “3/7/2015”

}

} ]

Now the question i have is

  1. “version” : “v1” : what does this version stand for.

You don’t need to worry about the version as a user. It is something in place to indicate the format of the results.

  1. Why is time stamp always the 1st date in query interval., in this case

When you issue a query with granularity “all”, all results are bucketed together and the timestamp is a reflection of where the data begins.