Error on Realtime when traffic increase

Hi,

I’ve a couple of Realtime nodes running with Kafka ingestion.

They are managing 9 datasources with different behaviors; one of them has a constant volume of messages around 2500/sec, another has an impulsive behavior and for 20 minutes send around 6-7000 messages /sec.

Now what I’m noticing is that when the impulsive ingestion starts the other slow down and increase the lag on Kafka.

Moreover on realtime logs I’ve these errors.

2017-02-28 12:19:31,159 ERROR o.I.z.ZkEventThread [ZkClient-EventThread-240-10.80.4.1:2181,10.80.4.2:2181,10.80.4.3:2181] Error handling event ZkEvent[New session event sent to kafka.consumer.ZookeeperConsumerConnector$ZKSessionExpireListener@393eee5e]

kafka.common.ConsumerRebalanceFailedException: druidaws6_aws-druid-realtime4-1487714441368-784ef5e3 can’t rebalance after 4 retries

at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633) ~[kafka_2.10-0.8.2.1.jar:?]

at kafka.consumer.ZookeeperConsumerConnector$ZKSessionExpireListener.handleNewSession(ZookeeperConsumerConnector.scala:487) ~[kafka_2.10-0.8.2.1.jar:?]

at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472) ~[zkclient-0.3.jar:0.3]

at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [zkclient-0.3.jar:0.3]

2017-02-28 12:19:31,166 ERROR o.I.z.ZkEventThread [ZkClient-EventThread-178-10.80.4.1:2181,10.80.4.2:2181,10.80.4.3:2181] Error handling event ZkEvent[New session event sent to kafka.consumer.ZookeeperConsumerConnector$ZKSessionExpireListener@58336391]

kafka.common.ConsumerRebalanceFailedException: druidaws6_aws-druid-realtime4-1487714433590-2145d199 can’t rebalance after 4 retries

at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633) ~[kafka_2.10-0.8.2.1.jar:?]

at kafka.consumer.ZookeeperConsumerConnector$ZKSessionExpireListener.handleNewSession(ZookeeperConsumerConnector.scala:487) ~[kafka_2.10-0.8.2.1.jar:?]

at org.I0Itec.zkclient.ZkClient$4.run(ZkClient.java:472) ~[zkclient-0.3.jar:0.3]

at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [zkclient-0.3.jar:0.3]

``

All the datasources has the same configuration in spec file ioConfig:

“ioConfig” : {

“type” : “realtime”,

“firehose”: {

“type”: “kafka-0.8”,

“consumerProps”: {

“zookeeper.connect”: “10.80.4.1:2181,10.80.4.2:2181,10.80.4.3:2181”,

“zookeeper.connection.timeout.ms” : “15000”,

“zookeeper.session.timeout.ms” : “15000”,

“zookeeper.sync.time.ms” : “5000”,

“group.id”: “druidaws6”,

“fetch.message.max.bytes” : “1048586”,

“auto.offset.reset”: “largest”,

“auto.commit.enable”: “false”

},

“feed”: “buck_XXXXX”

},

“plumber”: {

“type”: “realtime”

}

}

``

A couple of questions:

  • using the same group.id for all datasources could be a problem? Is it better to have something like druidaws6_ds1, druidaws6_ds2 … etc?

  • I’ve used a couple of Realtime for fault tolerance with “shardSpec”: {“type”: “linear”,“partitionNum”: 0}, could be better to have different Realtime for different datasources?

Any other idea about the error message?

Thanks for your help

Maurizio