ZKCoordinator deletes segments and segment cache

I have k8s cluster of 3 nodes and I have installed druid on k8s cluster via helm chart so that each component is running as a pod correctly

I am also able to Ingest data from kafka which has the application logs and datacource is created and also iam able to see the segments and also able to query live logs in druid

But later after sometime i see that the segments are becoming unavailable and ingestion tasks are failing

As i went on to see the pod logs :

  • historical pod says that zkcoordinator is unable to unannounce segment in the path /var/druid/segment//…

  • so that zkcoordinator is telling it ot delete the segments and segment cache

  • Later the same zkcoordinator is complaining that it cant find anything under /var/druid/segment-cache/…

Please if anyone can suggest anything for me that would be helpful

and also the default pv it took is of 4gb for historical (maybe it has the storage issues? have to check that)

Hm it would sound like that to me - maybe check the file permissions?

Hi @petermarshallio Thanks fir replying…

I had given 50 GB(from 4 GB default) of persistence to the historical component but still fails with same problem after running for 2 hours max.

Here is the error block from historical pod:

2021-03-22T04:21:47,905 INFO [ZkCoordinator] org.apache.druid.server.coordination.ZkCoordinator - zNode[/druid/loadQueue/10.0.2.66:8083/-log_2021-03-19T09:00:00.000Z_2021-03-19T10:00:00.000Z_2021-03-19T09:06:30.395Z] was removed
2021-03-22T04:21:47,906 INFO [ZKCoordinator–0] org.apache.druid.server.coordination.ZkCoordinator - Completed request [LOAD: -log_2021-03-19T09:00:00.000Z_2021-03-19T10:00:00.000Z_2021-03-19T09:06:30.395Z]
2021-03-22T04:21:47,906 INFO [ZKCoordinator–0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment -log_2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z_2021-03-19T08:00:34.114Z_1
2021-03-22T04:21:47,906 WARN [ZKCoordinator–0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - No path to unannounce segment[-log_2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z_2021-03-19T08:00:34.114Z_1]
2021-03-22T04:21:47,906 INFO [ZKCoordinator–0] org.apache.druid.server.SegmentManager - Told to delete a queryable for a dataSource[-log] that doesn’t exist.
2021-03-22T04:21:47,906 INFO [ZKCoordinator–0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/-log/2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z/2021-03-19T08:00:34.114Z/1]
2021-03-22T04:21:47,906 INFO [ZKCoordinator–0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/-log/2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z/2021-03-19T08:00:34.114Z]
2021-03-22T04:21:47,907 INFO [ZKCoordinator–0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/-log/2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z]
2021-03-22T04:21:47,907 INFO [ZKCoordinator–0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/-log]
2021-03-22T04:21:47,907 WARN [ZKCoordinator–0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Unable to delete segmentInfoCacheFile[var/druid/segment-cache/info_dir/-log_2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z_2021-03-19T08:00:34.114Z_1]
2021-03-22T04:21:47,907 ERROR [ZKCoordinator–0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Failed to load segment for dataSource: {class=org.apache.druid.server.coordination.SegmentLoadDropHandler, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[-log_2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z_2021-03-19T08:00:34.114Z_1], segment=DataSegment{binaryVersion=9, id=-log_2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z_2021-03-19T08:00:34.114Z_1, loadSpec={type=>local, path=>/opt/apache-druid-0.19.0/var/druid/segments/-log/2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z/2021-03-19T08:00:34.114Z/1/377963ca-1867-4f9e-9917-7f4368ad9d53/index.zip}, dimensions=[], metrics=, shardSpec=NumberedShardSpec{partitionNum=1, partitions=0}, lastCompactionState=null, size=3326}}
org.apache.druid.segment.loading.SegmentLoadingException: Exception loading segment[-log_2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z_2021-03-19T08:00:34.114Z_1]
at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:269) ~[druid-server-0.19.0.jar:0.19.0]
at org.apache.druid.server.coordination.SegmentLoadDropHandler.addSegment(SegmentLoadDropHandler.java:313) ~[druid-server-0.19.0.jar:0.19.0]
at org.apache.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:61) ~[druid-server-0.19.0.jar:0.19.0]
at org.apache.druid.server.coordination.ZkCoordinator.lambda$childAdded$2(ZkCoordinator.java:147) ~[druid-server-0.19.0.jar:0.19.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_252]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_252]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.lang.IllegalArgumentException: Cannot construct instance of org.apache.druid.segment.loading.LocalLoadSpec, problem: [/opt/apache-druid-0.19.0/var/druid/segments/-log/2021-03-19T08:00:00.000Z_2021-03-19T09:00:00.000Z/2021-03-19T08:00:34.114Z/1/377963ca-1867-4f9e-9917-7f4368ad9d53/index.zip] does not exist

Please take a look and help me out…

and i checked the file permissions inside the pod and it seems correct to the druid user.

Aha! Have you set up your deep storage? This might apply to you:
Please see Historical server fails to load segments in kubernetes · Issue #10523 · apache/druid · GitHub

I just updated it with s3 deep storage and everything is working fine :smiley: i was about to post the solution here as well…

thanks @petermarshallio