Segments don't handoff && Segments doubled

Hi all,

This problem is very urgent.

Our druid cluster has run for almost a year. The problem is as follow.

  1. Segments don’t handoff, after the task duration escaped, the segments began to handoff. But they did not until the task gracefully stop, since completionTimeout(PT30m) was escaped.

  2. The coordinator only ask a few segments to be loaded by historical.

  3. The HSDF and MySQL all has the segments, but the size are not total equal, some segment’s size are differ. And the size in MySQL are doubled.

coordinator’s log

2016-11-10T18:19:00,966 INFO [ServerInventoryView-0] io.druid.curator.inventory.CuratorInventoryManager - Created new InventoryCacheListener for /druid/production/segments/bigdata:8106

2016-11-10T18:19:00,966 INFO [ServerInventoryView-0] io.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name=‘bigdata:8106’, host=‘bigdata:8106’, maxSize=0, tier=’_default_tier’, type=‘indexer-executor’, priority=‘0’}]

2016-11-10T19:49:19,671 WARN [ServerInventoryView-0] io.druid.curator.inventory.CuratorInventoryManager - Exception while getting data for node /druid/production/segments/bigdata:8106/bigdata:8106_indexer-executor__default_tier_2016-11-10T10:56:11.432Z_06ca6fba082442778d7457a85de5fabb1

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/production/segments/bigdata:8106/bigdata:8106_indexer-executor__default_tier_2016-11-10T10:56:11.432Z_06ca6fba082442778d7457a85de5fabb1

2016-11-10T19:49:19,672 INFO [ServerInventoryView-0] io.druid.curator.inventory.CuratorInventoryManager - Ignoring event: Type - CHILD_UPDATED , Path - /druid/production/segments/bigdata:8106/bigdata:8106_indexer-executor__default_tier_2016-11-10T10:56:11.432Z_06ca6fba082442778d7457a85de5fabb1 , Version - 70

2016-11-10T19:49:19,715 INFO [ServerInventoryView-0] io.druid.client.BatchServerInventoryView - Server Disappeared[DruidServerMetadata{name=‘bigdata:8106’, host=‘bigdata:8106’, maxSize=0, tier=’_default_tier’, type=‘indexer-executor’, priority=‘0’}]

The data is as follows.

Thanks a lot,

Xinxin

Hi

multiple issues can cause the handoff to stop.

First check that both realtime and historical can read/write from hdfs.

Second check that the historical has enough capacity to load new segments (you can see this from the coordinator console).

Check that the realtime nodes are writing the segments descriptors to the meta data storage.

In the other hand what did change recently that can cause this ?

Finally sharing more logs from realtime/historical can help.

Hi Slim,

Our Kafka indexing task could write to HDFS and there was plenty space. The Historical node also could read from HDFS.

Thus we suspected coordinator and zookeeper. Maybe there were so many segments needed to add and drop, the coordinator and zookeeper became bottleneck.

Thanks a lot,

Xinxin

在 2016年11月11日星期五 UTC+8上午1:39:02,Slim Bouguerra写道:

Does this help?
http://druid.io/docs/0.9.2/ingestion/faq.html