Questions about realtime nodes

Hi,
I have some questions about realtime nodes:

1.When will the segment of realtime node upload to deep storage?

Here is my json config:

“granularitySpec” : {

“type”: “uniform”,

“segmentGranularity”: “FIVE_MINUTE”,

“queryGranularity”: “MINUTE”

}

},

“tuningConfig”: {

  "type" : "realtime",

  "maxRowsInMemory": 50000,

  "intermediatePersistPeriod": "PT5m",

  "windowPeriod": "PT5m",

  "basePersistDirectory": "\/data\/realtime\/basePersist",

  "rejectionPolicy": {

    "type": "messageTime"

  },

  "shardSpec": {

    "type": "linear",

    "partitionNum": 0

  }

}

For choosing “messageTime” rejectionPolicy, I think the segment will write to disk per 5 minutes(intermediatePersistPeriod), then after “segmentGranularity” + “windowPeriod”, the segment on disk will upload to deep storage, then delete this segment from local disk.Is it right?

If segment in interval 2015-06-28T00:00:00Z to 2015-06-28T00:05:00Z, and serving time is 2015-06-28T01:00:00Z, this segment will be upload to deep storage, is it right?

But I haven’t found it upload to deep storage, I have no idea about it…

2.For realtime node Redundancy.

If I want to generate the realtime nodes Redundant, Should I get two different consumer group for it?

For example, realtime nodes A1, B1, C1 in consumer group 1, and be assigned different partitionNumber(0,1,2).

and get another realtime nodes A2, B2, C2 in consumer group 2 for redundancy, and assigned partitionNumber(0,1,2)

If realtime node A1 crashed, broker will get the result from A2.

But we can’t generate the data of A2 realtime are the same as A1, because the kafka partition data distributed are random.

So how can I generate the realtime node redundant?

Some thoughts inline.

Hi,
I have some questions about realtime nodes:

1.When will the segment of realtime node upload to deep storage?

The configuraton for which deep storage to use is in your common configuration. You should include it as part of the classpath of your rt node.

Here is my json config:

“granularitySpec” : {

“type”: “uniform”,

“segmentGranularity”: “FIVE_MINUTE”,

“queryGranularity”: “MINUTE”

}

},

“tuningConfig”: {

  "type" : "realtime",
  "maxRowsInMemory": 50000,
  "intermediatePersistPeriod": "PT5m",
  "windowPeriod": "PT5m",
  "basePersistDirectory": "\/data\/realtime\/basePersist",
  "rejectionPolicy": {
    "type": "messageTime"
  },
  "shardSpec": {
    "type": "linear",
    "partitionNum": 0
  }
}

For choosing “messageTime” rejectionPolicy, I think the segment will write to disk per 5 minutes(intermediatePersistPeriod), then after “segmentGranularity” + “windowPeriod”, the segment on disk will upload to deep storage, then delete this segment from local disk.Is it right?

If you have a stream of always current time data, I highly recommend using serverTime rejectionPolicy instead. If you have historical data, I recommend using one of the batch ingestion processes.

If segment in interval 2015-06-28T00:00:00Z to 2015-06-28T00:05:00Z, and serving time is 2015-06-28T01:00:00Z, this segment will be upload to deep storage, is it right?

But I haven’t found it upload to deep storage, I have no idea about it…

Unless you have a constant stream of data, the message time rejection policy will not hand off. I don’t know of anyone using this policy in production and we should remove it. The correct solution to load older data via a streaming mechanism is described here:

https://groups.google.com/forum/#!searchin/druid-development/windowperiod/druid-development/kHgHTgqKFlQ/fXvtsNxWzlMJ

2.For realtime node Redundancy.

If I want to generate the realtime nodes Redundant, Should I get two different consumer group for it?

For example, realtime nodes A1, B1, C1 in consumer group 1, and be assigned different partitionNumber(0,1,2).

and get another realtime nodes A2, B2, C2 in consumer group 2 for redundancy, and assigned partitionNumber(0,1,2)

If realtime node A1 crashed, broker will get the result from A2.

But we can’t generate the data of A2 realtime are the same as A1, because the kafka partition data distributed are random.

So how can I generate the realtime node redundant?

Please read: https://groups.google.com/forum/#!searchin/druid-development/fangjin$20yang$20"thoughts"/druid-development/aRMmNHQGdhI/muBGl0Xi_wgJ

Thanks,Fangjin very much. Could you help to explain about “a constant stream of data” ? what does it mean?

在 2015年6月29日星期一 UTC+8上午8:04:29,Fangjin Yang写道:

If you have servers or other machines constantly creating events and sending those events to Druid, you have a constant stream of events.