the best practice to change the realtime spec file

hi all,

Currently, if I want to add a new “dataSchema” into the realtime specfile, I need first kill the realtime node and restart it. But after every restart,the counting nunber at the restarted time is very large than normal. What’s wrong with this? Can we change the realtime spec file on the fly without restart the realtime node?

Thanks.

I have compared the “select” and “timeseries” two querytype returned result.
the “select” returned row num is correct (13160). But the “timeseries” sum count seemd double.(26541)

Anyone can tell me why?

Many thanks.

在 2015年8月28日星期五 UTC+8下午2:05:16,andyt…@gmail.com写道:

Hi Andy,

If you haven’t read this, it might be interesting: https://groups.google.com/forum/#!searchin/druid-development/fangjin$20yang$20"thoughts"/druid-development/aRMmNHQGdhI/muBGl0Xi_wgJ

What version of Druid is this? What are the queries you are issuing?

Hi Fangjin,

I used Druid 0.8 and I have read the post which you suggest. I will use tranquility for realtime ingestion in the near future. But currently,our production server use the realtime node with spec config file.If we want to add a new datasource or add a column,we have to restart the realtime node. As a side-effect, during the restarting, it has one or two minutes high counting number compared with the normal situation.Can we avoid this with proper restart processing ?

Thanks.

The select query is:

{

“queryType”: “select”,

“dataSource”: “t_info”,

“dimensions”:,

“metrics”:,

“granularity”: “all”,

“intervals”: [

“2015-08-28T13:44:00/2015-08-28T13:45:00”

],

“filter”: {

“type”: “and”,

“fields”: [

{

“type”: “selector”,

“dimension”: “type”,

“value”: 0

},

{

“type”: “javascript”,

“dimension”: “channel”,

“function”: “function(x) { return(x !=1) }”

}

]

},

“pagingSpec”:{“pagingIdentifiers”: {}, “threshold”:20000}

}

timeseries query is:

{

“queryType”: “timeseries”,

“dataSource”: “t_info”,

“granularity”: {

“type”: “duration”,

“timeZone”: “Asia/Shanghai”,

“duration”: “60000”,

“origin”: “2015-08-28T00:00:00”

},

“aggregations”: [

{

“type”: “longSum”,

“name”: “count”,

“fieldName”: “count”

}

],

“intervals”: [

“2015-08-28T13:44:00/2015-08-28T13:45:00”

],

“filter”: {

“type”: “and”,

“fields”: [

{

“type”: “selector”,

“dimension”: “type”,

“value”: 0

},

{

“type”: “javascript”,

“dimension”: “channel”,

“function”: “function(x) { return(x !=1) }”

}

]

}

}

在 2015年9月1日星期二 UTC+8上午2:25:13,Fangjin Yang写道:

The dataSchema is set like this:

{

“dataSchema”: {

“dataSource”: “t_info”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “auto”

},

“dimensionsSpec”: {

“dimensions”: [ “type”, “channel”],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec”: [{

“type”: “count”,

“name”: “count”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: “NONE”

}

},

“ioConfig”: {

“type”: “realtime”,

“firehose”: {

“type”: “kafka-0.8”,

“consumerProps”: {

“zookeeper.connect”: “test.test:2181”,

“zookeeper.connection.timeout.ms”: “15000”,

“zookeeper.session.timeout.ms”: “15000”,

“zookeeper.sync.time.ms”: “5000”,

“group.id”: “didi_monitor”,

“fetch.message.max.bytes”: “1048586”,

“auto.offset.reset”: “largest”,

“auto.commit.enable”: “false”

},

“feed”: “t_info”

},

“plumber”: {

“type”: “realtime”

}

},

“tuningConfig”: {

“type”: “realtime”,

“maxRowsInMemory”: 500000,

“intermediatePersistPeriod”: “PT1m”,

“windowPeriod”: “PT10m”,

“basePersistDirectory”: “/data/logs/druid/realtime/basePersist”,

“rejectionPolicy”: {

“type”: “messageTime”

},

“shardSpec”: {

“type”: “linear”,

“partitionNum”: 0

}

}

}

在 2015年9月1日星期二 UTC+8上午10:00:53,andyt…@gmail.com写道:

I see this issue IngestSegmentFirehose may not handle overlapping segments properly #1678 https://github.com/druid-io/druid/issues/1678

Maybe during the restart of realtime node , the segments may overlapped and cause double or triple counting ?

在 2015年8月28日星期五 UTC+8下午2:05:16,andyt…@gmail.com写道:

hi Fangjin,

If I upgrade druid from 0.8 to 0.8.1 and use the “kafka 8 simple consumer firehose #1482” ,can we solve this problem?

在 2015年9月1日星期二 UTC+8下午8:27:52,andyt…@gmail.com写道:

Hi Andy,

  1. For select query you are using ALL granularity and for timeseries something else. Is that the reason you are seeing different row counts?

  2. With standalone realtime node, you can’t change the spec file on the fly. You have to restart the node.

  3. I believe, “kafka 8 simple consumer firehose #1482” needs more work and not really ready just yet.

– Himanshu

PS: kafka -> druid is an area under active development and we hope to have better story in future.