[druid-user] Deep Storage having issues during ingestion

Hi All,

I observed the below logs in historical during ingestion.
HDFS is used for deep storage.

{“timeMillis”:1655199876468,“thread”:“ZKCoordinator–0”,“level”:“ERROR”,“loggerName”:“org.apache.druid.server.coordination.SegmentLoadDropHandler”,“message”:"Failed to load segment for dataSource: {class=org.apache.druid.server.coordination.SegmentLoadDropHandler, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[pmdata_2022-06-13T05:00:00.000Z_2022-06-13T06:00:00.000Z_2022-06-13T05:01:04.786Z_78], segment=DataSegment{binaryVersion=9, id=pmdata_2022-06-13T05:00:00.000Z_2022-06-13T06:00:00.000Z_2022-06-13T05:01:04.786Z_78, loadSpec={type=>hdfs, path=>hdfs://druid-hdfs-altiplano-hdfs-namenode.nom-apps.svc.cluster.local:8020/druid/segments/pmdata/20220613T050000.000Z_20220613T060000.000Z/2022-06-13T05_01_04.786Z/78_b8ded05c-b6f0-48c6-a6d9-d21e67b32491_index.zip}

Also I found the below logs ,

{“timeMillis”:1655201172368,“thread”:“ZKCoordinator–0”,“level”:“WARN”,“loggerName”:“org.apache.hadoop.hdfs.DFSClient”,“message”:“No live nodes contain block BP-1402020164-10.131.4.62-1654854729462:blk_1073866505_125754 after checking nodes = , ignoredNodes = null”,“endOfBatch”:false,“loggerFqcn”:“org.apache.logging.slf4j.Log4jLogger”,“threadId”:88,“threadPriority”:5}
{“timeMillis”:1655201172368,“thread”:“ZKCoordinator–0”,“level”:“INFO”,“loggerName”:“org.apache.hadoop.hdfs.DFSClient”,“message”:“No node available for BP-1402020164-10.131.4.62-1654854729462:blk_1073866505_125754 file=/druid/segments/pmdata/20220613T050000.000Z_20220613T060000.000Z/2022-06-13T05_01_04.786Z/0_1f6d0597-3e39-4af3-9c41-f782fbb7c58a_index.zip”,“endOfBatch”:false,“loggerFqcn”:“org.apache.logging.slf4j.Log4jLogger”,“threadId”:88,“threadPriority”:5}

Could anyone help wrt this if this kind of issue was seen earlier and any suggestion?

Hi,

Have you verified that the segments are in deep storage? The verification will help us understand if the issue is on the segment writing side or the segment loading side.

The only similar issue I’ve come across involved datasource names containing non-ASCII characters. Is that the case here?

Hi ,

Its clear from the log that some of the hdfs blocks could not be read by the hdfs client in the druid.
{“timeMillis”:1655201172368,“thread”:“ZKCoordinator–0”,“level”:“WARN”,“loggerName”:“org.apache.hadoop.hdfs.DFSClient”,“message”:“No live nodes contain block BP-1402020164-10.131.4.62-1654854729462:blk_1073866505_125754 after checking nodes = , ignoredNodes = null”,“endOfBatch”:false,“loggerFqcn”:“org.apache.logging.slf4j.Log4jLogger”,“threadId”:88,“threadPriority”:5}

If any other hdfs block replica is available, hdfs client will try it after all the retries exhaust. If not the segement loading will fill .

Couple of reason i could think of .

  1. all hdfs host names are not available in hosts file or either dns is not resolving the hostname.
  2. if dns is resolving the host name , check reverse dns resolution is configured
  3. check the hdfs health / monitoring that all blocks are present. or any changes done in the datanode recently .
  4. check the hdfs-site.xml is in druid is the correct one. You could copy this from namenode to druid along with other -site.xml (core-site, mapred-site, hdfs-site, yarn-site)

Thanks

Hi Mark and Tijo ,

Thanks for your valuable inputs and sorry for the late reply, Since the segments with the deep storage were corrupted, I had to clear the segments and re-ingest the data from scratch.
Is there an easier way/ tool to backup and restore this data?

Because if we end up in a similar situation, we might lose a lot of time again. If we have a backed up data that can be restored it would help us a lot

Also, Currently, we are executing ingestion and aggregation again with the newer data. Though the aggregation happens properly(with hardly 1% of the queries failing), we see similar problems with the logs of historicals.
Historicals go down constantly while the aggregation succeed eventually for most of the queries fired.

{“instant”:{“epochSecond”:1655455133,“nanoOfSecond”:438000000},“thread”:“ZKCoordinator–0”,“level”:“ERROR”,“loggerName”:“org.apache.druid.server.coordination.SegmentLoadDropHandler”,“message”:"Failed to load segment for dataSource: {class=org.apache.druid.server.coordination.SegmentLoadDropHandler, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[pmdata_2022-06-17T03:00:00.000Z_2022-06-17T04:00:00.000Z_2022-06-17T03:00:30.263Z_162]

and also below exception loading hadoop segment

{“instant”:{“epochSecond”:1655446838,“nanoOfSecond”:613000000},“thread”:“ZKCoordinator–0”,“level”:“ERROR”,“loggerName”:“org.apache.druid.segment.loading.SegmentLocalCacheManager”,“message”:“Failed to load segment in current location [/opt/druid/var/druid/segment-cache], try next location if any: {class=org.apache.druid.segment.loading.SegmentLocalCacheManager, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Error loading [hdfs://apache-hadoop-namenode.apps.svc.cluster.local:8020/druid/segments/pmdata/20220617T030000.000Z_20220617T040000.000Z/2022-06-17T03_00_30.263Z/179_b1445888-b60b-4916-a966-7aa22184fc23_index.zip], location=/opt/druid/var/druid/segment-cache}”,“thrown”:{“commonElementCount”:0,“localizedMessage”:“Error loading [hdfs://apache-hadoop-namenode.apps.svc.cluster.local:8020/druid/segments/pmdata/20220617T030000.000Z_20220617T040000.000Z/2022-06-17T03_00_30.263Z/179_b1445888-b60b-4916-a966-7aa22184fc23_index.zip]”,“message”:“Error loading [hdfs://apache-hadoop-namenode.apps.svc.cluster.local:8020/druid/segments/pmdata/20220617T030000.000Z_20220617T040000.000Z/2022-06-17T03_00_30.263Z/179_b1445888-b60b-4916-a966-7aa22184fc23_index.zip]”,“name”:“org.apache.druid.segment.loading.SegmentLoadingException”,“cause”:{“commonElementCount”:18,“localizedMessage”:“No space left on device”,“message”:“No space left on device”,“name”:“java.io.IOException”,“extendedStackTrace”:[{“class”:“java.io.RandomAccessFile”,“method”:“writeBytes”,“file”:“RandomAccessFile.java”,“line”:-2,“exact”:false,“location”:"?",“version”:“1.8.0_332”},{“class”:“java.io.RandomAccessFile”,“method”:“write”,“file”:“RandomAccessFile.java”,“line”:525,“exact”:false,“location”:"?",“version”:“1.8.0_332”},{“class”:“org.apache.druid.java.util.common.io.NativeIO”,“method”:“chunkedCopy”,“file”:“NativeIO.java”,“line”:221,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.utils.CompressionUtils”,“method”:“unzip”,“file”:“CompressionUtils.java”,“line”:304,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.utils.CompressionUtils”,“method”:“lambda$unzip$1”,“file”:“CompressionUtils.java”,“line”:188,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.java.util.common.RetryUtils”,“method”:“retry”,“file”:“RetryUtils.java”,“line”:129,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.java.util.common.RetryUtils”,“method”:“retry”,“file”:“RetryUtils.java”,“line”:81,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.java.util.common.RetryUtils”,“method”:“retry”,“file”:“RetryUtils.java”,“line”:163,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.java.util.common.RetryUtils”,“method”:“retry”,“file”:“RetryUtils.java”,“line”:153,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.utils.CompressionUtils”,“method”:“unzip”,“file”:“CompressionUtils.java”,“line”:187,“exact”:false,“location”:“druid-core-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.storage.hdfs.HdfsDataSegmentPuller”,“method”:“getSegmentFiles”,“file”:“HdfsDataSegmentPuller.java”,“line”:243,“exact”:false,“location”:"?",“version”:"?"}]},“extendedStackTrace”:[{“class”:“org.apache.druid.storage.hdfs.HdfsDataSegmentPuller”,“method”:“getSegmentFiles”,“file”:“HdfsDataSegmentPuller.java”,“line”:292,“exact”:false,“location”:"?",“version”:"?"},{“class”:“org.apache.druid.storage.hdfs.HdfsLoadSpec”,“method”:“loadSegment”,“file”:“HdfsLoadSpec.java”,“line”:57,“exact”:false,“location”:"?",“version”:"?"},{“class”:“org.apache.druid.segment.loading.SegmentLocalCacheManager”,“method”:“loadInLocation”,“file”:“SegmentLocalCacheManager.java”,“line”:327,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.segment.loading.SegmentLocalCacheManager”,“method”:“loadInLocationWithStartMarker”,“file”:“SegmentLocalCacheManager.java”,“line”:315,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.segment.loading.SegmentLocalCacheManager”,“method”:“loadInLocationWithStartMarkerQuietly”,“file”:“SegmentLocalCacheManager.java”,“line”:276,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.segment.loading.SegmentLocalCacheManager”,“method”:“loadSegmentWithRetry”,“file”:“SegmentLocalCacheManager.java”,“line”:255,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.segment.loading.SegmentLocalCacheManager”,“method”:“getSegmentFiles”,“file”:“SegmentLocalCacheManager.java”,“line”:211,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.segment.loading.SegmentLocalCacheLoader”,“method”:“getSegment”,“file”:“SegmentLocalCacheLoader.java”,“line”:52,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.server.SegmentManager”,“method”:“getSegmentReference”,“file”:“SegmentManager.java”,“line”:272,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.server.SegmentManager”,“method”:“loadSegment”,“file”:“SegmentManager.java”,“line”:219,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.server.coordination.SegmentLoadDropHandler”,“method”:“loadSegment”,“file”:“SegmentLoadDropHandler.java”,“line”:278,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.server.coordination.SegmentLoadDropHandler”,“method”:“addSegment”,“file”:“SegmentLoadDropHandler.java”,“line”:329,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.server.coordination.SegmentChangeRequestLoad”,“method”:“go”,“file”:“SegmentChangeRequestLoad.java”,“line”:61,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“org.apache.druid.server.coordination.ZkCoordinator”,“method”:“lambda$childAdded$2”,“file”:“ZkCoordinator.java”,“line”:150,“exact”:false,“location”:“druid-server-0.22.1.jar”,“version”:“0.22.1”},{“class”:“java.util.concurrent.Executors$RunnableAdapter”,“method”:“call”,“file”:“Executors.java”,“line”:511,“exact”:true,“location”:"?",“version”:“1.8.0_332”},{“class”:“java.util.concurrent.FutureTask”,“method”:“run”,“file”:“FutureTask.java”,“line”:266,“exact”:true,“location”:"?",“version”:“1.8.0_332”},{“class”:“java.util.concurrent.ThreadPoolExecutor”,“method”:“runWorker”,“file”:“ThreadPoolExecutor.java”,“line”:1149,“exact”:true,“location”:"?",“version”:“1.8.0_332”},{“class”:“java.util.concurrent.ThreadPoolExecutor$Worker”,“method”:“run”,“file”:“ThreadPoolExecutor.java”,“line”:624,“exact”:true,“location”:"?",“version”:“1.8.0_332”},{“class”:“java.lang.Thread”,“method”:“run”,“file”:“Thread.java”,“line”:750,“exact”:true,“location”:"?",“version”:“1.8.0_332”}]},“endOfBatch”:false,“loggerFqcn”:“org.apache.logging.slf4j.Log4jLogger”,“threadId”:85,“threadPriority”:5,“timestamp”:“2022-06-17T06:20:38.613+0000”}

Regards,
Chaitanya

Hi Tijo ,

I forgot to reply to your 4 questions. Question 1 and 2 will be answered later . For Question 3 , this was tried out, and it should a large number of segments over 100k were corrupted.
Wrt your last question, those xml data is the same across name node and historical too.

Regards,
Chaitanya