Historical node with less data than the rest

Hi there!!

Our cluster consists of 4 historical nodes, all servers have the same configuration but one of them has less data than the others:

On that server I checked for error logs on historical log files and discovered the following error.

2016-11-07T15:28:32,516 ERROR [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Failed to load segment for dataSource: {class=io.druid.ser
ver.coordination.ZkCoordinator, exceptionType=class io.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[relyEv
entData_2016-11-07T11:00:00.000Z_2016-11-07T12:00:00.000Z_2016-11-07T11:00:30.211Z], segment=DataSegment{size=102894248, shardSpec=LinearShardSpec{part
itionNum=0}, metrics=[count, user_unique], dimensions=[id_partner, id_partner_user, event_type, created, url, tags, url_path, tagged, country, url_qs,
vertical, url_subdomain, url_domain, segments, share_data, category, title, nav_type, ip, referer_subdomain, browser, search_keyword, id_segment_source
, version, referer_qs, referer_path, referer, referer_domain, sec, data_type, gt, track_type, track_code], version=‘2016-11-07T11:00:30.211Z’, loadSpec
={type=hdfs, path=/druid/eventData/20161107T110000.000Z_20161107T120000.000Z/2016-11-07T11_00_30.211Z/0/index.zip}, interval=2016-11-07T11:00:00.00
0Z/2016-11-07T12:00:00.000Z, dataSource=‘eventData’, binaryVersion=‘9’}}
io.druid.segment.loading.SegmentLoadingException: Exception loading segment[eventData_2016-11-07T11:00:00.000Z_2016-11-07T12:00:00.000Z_2016-11-07T
11:00:30.211Z]
at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:309) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordination.ZkCoordinator.addSegment(ZkCoordinator.java:350) [druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:44) [druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordination.ZkCoordinator$1.childEvent(ZkCoordinator.java:152) [druid-server-0.9.1.1.jar:0.9.1.1]
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.10.0.jar:?]
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.10.0.jar:?]
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.10.0.jar:?]
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [guava-16.0.1.jar:?]
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.10.0.jar:?]
at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.10.0.jar:?]
at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.10.0.jar:?]
at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:772) [curator-recipes-2.10.0.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_101]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_101]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_101]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_101]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_101]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_101]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File /druid/eventData/20161107T110000.000Z_20161107T120000.000Z/2016-11-07T11
_00_30.211Z/0/index.zip does not exist
at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
at com.metamx.common.CompressionUtils.unzip(CompressionUtils.java:146) ~
[java-util-0.27.9.jar:?]
at io.druid.storage.hdfs.HdfsDataSegmentPuller.getSegmentFiles(HdfsDataSegmentPuller.java:235) ~[?:?]
at io.druid.storage.hdfs.HdfsLoadSpec.loadSegment(HdfsLoadSpec.java:62) ~[?:?]
at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:143) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:95) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]
… 18 more
Caused by: java.io.FileNotFoundException: File /druid/eventData/20161107T110000.000Z_20161107T120000.000Z/2016-11-07T11_00_30.211Z/0/index.zip does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) ~[?:?]
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:722) ~[?:?]
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) ~[?:?]
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) ~[?:?]
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) ~[?:?]
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) ~[?:?]
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:765) ~[?:?]
at io.druid.storage.hdfs.HdfsDataSegmentPuller$1.openInputStream(HdfsDataSegmentPuller.java:107) ~[?:?]
at io.druid.storage.hdfs.HdfsDataSegmentPuller.getInputStream(HdfsDataSegmentPuller.java:298) ~[?:?]
at io.druid.storage.hdfs.HdfsDataSegmentPuller$3.openStream(HdfsDataSegmentPuller.java:241) ~[?:?]
at com.metamx.common.CompressionUtils$1.call(CompressionUtils.java:138) ~[java-util-0.27.9.jar:?]
at com.metamx.common.CompressionUtils$1.call(CompressionUtils.java:134) ~[java-util-0.27.9.jar:?]
at com.metamx.common.RetryUtils.retry(RetryUtils.java:60) ~[java-util-0.27.9.jar:?]
at com.metamx.common.RetryUtils.retry(RetryUtils.java:78) ~[java-util-0.27.9.jar:?]
at com.metamx.common.CompressionUtils.unzip(CompressionUtils.java:132) ~[java-util-0.27.9.jar:?]
at io.druid.storage.hdfs.HdfsDataSegmentPuller.getSegmentFiles(HdfsDataSegmentPuller.java:235) ~[?:?]
at io.druid.storage.hdfs.HdfsLoadSpec.loadSegment(HdfsLoadSpec.java:62) ~[?:?]
at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:143) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:95) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordination.ServerManager.loadSegment(ServerManager.java:152) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordination.ZkCoordinator.loadSegment(ZkCoordinator.java:305) ~[druid-server-0.9.1.1.jar:0.9.1.1]
… 18 more

``

That is really weird because the error says “FileNotFound” exception, but the file is present on hadoop files.

One thing I’d like to mention is that we have a kill task running every day for data older than 60 days.

Any help will be much appreciated, thank you !

Hi Federico,

I ran into that error as well and I do not know what caused it. What I did to resolve it was to delete the segment-cache from that historical node and let the Coordinator re-assign the segments. You could probably just delete the single bad segment from the cache, but my cluster was in a bad spot and I was trying to reset all of my historicals.

Good luck,

–Ben

Hi Ben, thank you very much for your quick response.

Yes I saw your post on this forum, and I deleted all content inside folders:

CACHE_FOLDER/historical/eventData

CACHE_FOLDER/historical/info_dir

This made coordinator start pushing all the segments again. And when it finished it was like the image I posted earlier, with all historicals at ~70% but the one that is failing at 40% capacity.

Any other thoughts?

Thank again !

Never mind, I just found the issue, that hadoop path was bad configured on the service startup. We moved hadoop folders some time ago and seems like we forgot to update this value, because it was very similar it was hard to find. Not sure why it was partially working even though path was not correctly set. Thanks for the help !