Historicals not able to load shards of some segments

Hello,

Currently I am facing an issue where in the certain historicals are not able to load particular shards of a given segments. This is happening only with a certain set of segments while other are getting loaded fine. In my deepstorage, the corresponding shards of the segment is present

The logs on the coordinator read:

[ERROR] 2017-02-16 03:22:11.014 [Master-PeonExec–0] LoadQueuePeon - Server[/druid/loadQueue/ccg22history035623.ccg22.com:8083], throwable caught when submitting [SegmentChangeRequestLoad{segment=DataSegment{size=530025378, shardSpec=HashBasedNumberedShardSpec{partitionNum=1, partitions=5, partitionDimensions=}, metrics=[records, pageviews, visits, entrypage, exitpage, clicks, bounces, tpage_sum], dimensions=[page_group, page_name, pagegroup_link_name, page_link_name], version=‘2017-02-08T04:45:54.230Z’, loadSpec={type=hdfs, path=hdfs://druid/deepstorage/druid_ingest/20170207T200000.000Z_20170207T210000.000Z/2017-02-08T04_45_54.230Z/1/index.zip}, interval=2017-02-07T20:00:00.000Z/2017-02-07T21:00:00.000Z, dataSource=‘druid_ingest’, binaryVersion=‘9’}}].

On the historical, the error message is as follows:

[ERROR] 2017-02-16 07:06:35.179 [ZkCoordinator-0] ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.server.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[druid_ingest_2017-02-07T20:00:00.000Z_2017-02-07T21:00:00.000Z_2017-02-08T04:45:54.230Z_1], segment=DataSegment{size=530025378, shardSpec=HashBasedNumberedShardSpec{partitionNum=1, partitions=5, partitionDimensions=}, metrics=[records, pageviews, visits, entrypage, exitpage, clicks, bounces, tpage_sum], dimensions=[page_group, page_name, pagegroup_link_name, page_link_name], version=‘2017-02-08T04:45:54.230Z’, loadSpec={type=hdfs, path=hdfs://druid_ingest/20170207T200000.000Z_20170207T210000.000Z/2017-02-08T04_45_54.230Z/1/index.zip}, interval=2017-02-07T20:00:00.000Z/2017-02-07T21:00:00.000Z, dataSource=‘druid_ingest’, binaryVersion=‘9’}}

io.druid.segment.loading.SegmentLoadingException: Exception loading segment[druid_ingest_2017-02-07T20:00:00.000Z_2017-02-07T21:00:00.000Z_2017-02-08T04:45:54.230Z_1]

    at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:310) ~[druid-server-0.9.2.jar:0.9.2]

    at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:351) [druid-server-0.9.2.jar:0.9.2]

    at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44) [druid-server-0.9.2.jar:0.9.2]

    at io.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:153) [druid-server-0.9.2.jar:0.9.2]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.11.0.jar:?]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.11.0.jar:?]

    at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.11.0.jar:?]

    at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]

    at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:84) [curator-framework-2.11.0.jar:?]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:513) [curator-recipes-2.11.0.jar:?]

    at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.11.0.jar:?]

    at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) [curator-recipes-2.11.0.jar:?]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_73]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_73]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_73]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_73]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_73]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_73]

    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]

Caused by: java.lang.IllegalStateException

    at com.google.common.base.Preconditions.checkState(Preconditions.java:161) ~[guava-16.0.1.jar:?]

    at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:636) ~[?:?]

    at org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1062) ~[?:?]

    at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1080) ~[?:?]

    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:208) ~[?:?]

    at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source) ~[?:?]

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_73]

    at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_73]

    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) ~[?:?]

    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[?:?]

    at com.sun.proxy.$Proxy60.getBlockLocations(Unknown Source) ~[?:?]

    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1131) ~[?:?]

    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1121) ~[?:?]

    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1111) ~[?:?]

    at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:272) ~[?:?]

    at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:239) ~[?:?]

    at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:232) ~[?:?]

    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1279) ~[?:?]

    at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:296) ~[?:?]

    at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:292) ~[?:?]

    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[?:?]

    at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:292) ~[?:?]

    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:765) ~[?:?]

    at io.druid.storage.hdfs.HdfsDataSegmentPuller$1.openInputStream(HdfsDataSegmentPuller.java:107) ~[?:?]

    at io.druid.storage.hdfs.HdfsDataSegmentPuller.getInputStream(HdfsDataSegmentPuller.java:298) ~[?:?]

    at io.druid.storage.hdfs.HdfsDataSegmentPuller$3.openStream(HdfsDataSegmentPuller.java:241) ~[?:?]

    at com.metamx.common.CompressionUtils$1.call(CompressionUtils.java:138) ~[java-util-0.27.10.jar:?]

    at com.metamx.common.CompressionUtils$1.call(CompressionUtils.java:134) ~[java-util-0.27.10.jar:?]

    at com.metamx.common.RetryUtils.retry(RetryUtils.java:60) ~[java-util-0.27.10.jar:?]

    at com.metamx.common.RetryUtils.retry(RetryUtils.java:78) ~[java-util-0.27.10.jar:?]

    at com.metamx.common.CompressionUtils.unzip(CompressionUtils.java:132) ~[java-util-0.27.10.jar:?]

    at io.druid.storage.hdfs.HdfsDataSegmentPuller.getSegmentFiles(HdfsDataSegmentPuller.java:235) ~[?:?]

    at io.druid.storage.hdfs.HdfsLoadSpec.loadSegment(HdfsLoadSpec.java:62) ~[?:?]

    at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:143) ~[druid-server-0.9.2.jar:0.9.2]

    at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:95) ~[druid-server-0.9.2.jar:0.9.2]

    at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.2.jar:0.9.2]

    at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:306) ~[druid-server-0.9.2.jar:0.9.2]

Can you help me decipher the error message in order to figure out what might be happening wrong?

Regards,

Asra

this seem to be an issue from the hadoop hdfs side which hadoop version is this ?

We are currently using Hadoop 2.6.0.2.2.9.0-3393

i am afraid this is a configuration issue, not sure how to reproduce this.

Can you check what makes some nodes fail and not others maybe they have different hadoop config files ?