Doubt in forming Query

Hi,
I am learning druid. I have inserted data into druid using kafka and want to query it.

This is my spec file.

[

{

“dataSchema” : {

“dataSource” : “test”,

“parser” : {

“type” : “string”,

“parseSpec” : {

“format” : “json”,

“timestampSpec” : {

“column” : “timestamp”,

“format” : “auto”

},

“dimensionsSpec” : {

“dimensions”: [“query”],

“dimensionExclusions” : ,

“spatialDimensions” :

}

}

},

“metricsSpec” : [{

“type” : “count”,

“name” : “count”

}],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “DAY”,

“queryGranularity” : “NONE”

}

},

“ioConfig” : {

“type” : “realtime”,

“firehose”: {

“type”: “kafka-0.8”,

“consumerProps”: {

“zookeeper.connect”: “localhost:2181”,

zookeeper.connection.timeout.ms” : “15000”,

zookeeper.session.timeout.ms” : “15000”,

zookeeper.sync.time.ms” : “5000”,

group.id”: “druid-example”,

“fetch.message.max.bytes” : “1048586”,

“auto.offset.reset”: “largest”,

“auto.commit.enable”: “false”

},

“feed”: “test”

},

“plumber”: {

“type”: “realtime”

}

},

“tuningConfig”: {

“type” : “realtime”,

“maxRowsInMemory”: 500000,

“intermediatePersistPeriod”: “PT10m”,

“windowPeriod”: “PT10m”,

“basePersistDirectory”: “/tmp/realtime/basePersist”,

“rejectionPolicy”: {

“type”: “messageTime”

}

}

}

]

And this is the format in which i have inserted data :

{“timestamp”: “2014-01-07T10:44:58Z”, “query”: “Gypsy Danger”}

{“timestamp”: “2014-03-07T10:45:58Z”, “query”: “gold”}

{“timestamp”: “2014-09-07T10:43:58Z”, “query”: “gold”}

{“timestamp”: “2015-02-07T10:46:58Z”, “query”: “Gypsy Danger”}

{“timestamp”: “2015-04-07T10:44:48Z”, “query”: “silver”}

{“timestamp”: “2015-11-07T10:44:38Z”, “query”: “Gypsy Danger”}

I want to write a query which returns data in this manner :

when i pass the timestamp intervals( Suppose I enter interval as “2014-01-01T00:00/2015-12-31T00” )it should return the aggregated query in sorted order by their volume :

something like this :

{

“query”: “Gypsy Danger”,

“volume” : 3

},

{

“query”: “gold”,

“volume” : 2

},

{

“query”: “silver”,

“volume” : 1

}

This is the query which i formed but it does not return the result as intended.

{

“queryType”: “topN”,

“dataSource”: “test”,

“granularity”: “day”,

“dimension”: “query”,

“metric”: “volume”,

“threshold” : 100,

“aggregations”: [

{“type”: “longSum”, “fieldName”: “count”, “name”: “volume”}

],

“intervals”: [“2014-01-01T00:00/2015-12-31T00”]

}

What changes should i do ?

Thanks

What is the SQL query you are trying to write?

You may want to take a look at: https://github.com/implydata/plyql

I am trying something like this :

SELECT query,

SUM (count) as volume

FROM table

GROUP BY query;

The data looks like this :

{“timestamp”: “2014-01-07T10:44:58Z”, “query”: “Gypsy Danger”}

{“timestamp”: “2014-03-07T10:45:58Z”, “query”: “gold”}

Ritesh,
Can you let us know what was the output of the query you fired?

And you said it was not same as expected,so do let us know the expected output as well.

Thanks

Rohit

Hi Rohit,

My query looks like this :

{

“queryType”: “groupBy”,

“dataSource”: “test”,

“granularity”: {“type”: “period”, “period”: “P10Y”, “timeZone”: “America/Los_Angeles”},

“dimensions”: [“query”],

“limitSpec”: { “type”: “default”, “limit”: 10, “columns”: [“query”] },

“aggregations”: [

{“type”: “longSum”, “fieldName”: “count”, “name”: “volume”}

],

“intervals”: [ “2014-01-01T00:00:00.000/2015-12-31T00:00:00.000” ]

}

Problems :

  1. The result does not come in sorted way.I want to sort it by volume by it does not happen.I have included this in my query : “orderBy”:{“type”:“default”,“limit”:10,“columns”:[{“dimension”:“volume”,“direction”:“descending”}]}. But there is no effect. I also tried adding this : “OrderByColumnSpecs” : {“dimension”: “volume”,“direction”: “descending”}. Again no effect.I am adding these lines just after aggregation.
  2. I am not sure want granularity to pass.As these queries will be dynamic.So , the time range can differ from 5 years to few month or even to a day.Now i have fixed the granularity for 10 years, which i don’t feel is the right way to do this.

For your reference my output looks like this :

[ {

“version” : “v1”,

“timestamp” : “2010-01-01T00:00:00.000-08:00”,

“event” : {

“query” : “Gypsy Danger”,

“volume” : 3

}

}, {

“version” : “v1”,

“timestamp” : “2010-01-01T00:00:00.000-08:00”,

“event” : {

“query” : “class rings”,

“volume” : 4

}

}, {

“version” : “v1”,

“timestamp” : “2010-01-01T00:00:00.000-08:00”,

“event” : {

“query” : “gold”,

“volume” : 2

}

}, {

“version” : “v1”,

“timestamp” : “2010-01-01T00:00:00.000-08:00”,

“event” : {

“query” : “medic alert bracelets”,

“volume” : 4

}

} ]

You can find by spec file as well as the input message from the previous mail.

Hi Ritesh,

If you don’t just want to issue SQL queries directly with https://github.com/implydata/plyql, here’s a few thoughts.

Hi Rohit,

My query looks like this :

{

“queryType”: “groupBy”,

“dataSource”: “test”,

“granularity”: {“type”: “period”, “period”: “P10Y”, “timeZone”: “America/Los_Angeles”},

“dimensions”: [“query”],

“limitSpec”: { “type”: “default”, “limit”: 10, “columns”: [“query”] },

“aggregations”: [

{“type”: “longSum”, “fieldName”: “count”, “name”: “volume”}

],

“intervals”: [ “2014-01-01T00:00:00.000/2015-12-31T00:00:00.000” ]

}

This should really be a topN query. GroupBy is needlessly expensive here: http://druid.io/docs/latest/operations/recommendations.html

Problems :

  1. The result does not come in sorted way.I want to sort it by volume by it does not happen.I have included this in my query : “orderBy”:{“type”:“default”,“limit”:10,“columns”:[{“dimension”:“volume”,“direction”:“descending”}]}. But there is no effect. I also tried adding this : “OrderByColumnSpecs” : {“dimension”: “volume”,“direction”: “descending”}. Again no effect.I am adding these lines just after aggregation.

Where did you get this query syntax from? “orderBy” is not a supported key in groupBy queries:

http://druid.io/docs/latest/querying/groupbyquery.html

You should be using the limitSpec to sort results.

  1. I am not sure want granularity to pass.As these queries will be dynamic.So , the time range can differ from 5 years to few month or even to a day.Now i have fixed the granularity for 10 years, which i don’t feel is the right way to do this.

Druid supports arbitrary granularities. You can just change them.

Hi,

Thanks Fangjin.

I edited my query to topN.

It looks like this :

{

“queryType”: “topN”,

“dataSource”: “test”,

“granularity”: “day”,

“dimension”: “query”,

“metric”: {“type”:“numeric”,“metric”:“volume”},

“threshold” : 10,

“aggregations”: [

{“type”: “longSum”, “fieldName”: “count”, “name”: “volume”}

],

“intervals”: [“2014-01-01T00:00/2015-12-31T00”]

}

Thanks for the help.