Broker going down after 4days of ingestion

We have 1 broker, it was running fine for 3 days, but it is now going down and restarting again every 2-3 minutes. There are no errors in the logs and the memory consumption is also less wrt to limit provided.

Hey Brakhar – if you grep for ERROR or WARN in the broker log, does it give you any hints?

My immediate thought was that you are hitting out-of-memory on your JRE.

(The broker is only responsible for planning query execution and merges of final results – so as you have more segments for it to think about, it requires more memory over time. That would be the only correlation I’m aware of between ingestion and the broker.)

I cannot see any ERROR in the logs, but for WARN I can see “Unable to load native-Hadoop library for your platform…” But I don’t think that is causing any issues.

Logs of broker:

{“instant”:{“epochSecond”:1656930386,“nanoOfSecond”:404000000},“thread”:“main”,“level”:“WARN”,“loggerName”:“org.apache.hadoop.util.NativeCodeLoader”,“message”:“Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”,“endOfBatch”:false,“loggerFqcn”:“org.apache.commons.logging.impl.SLF4JLocationAwareLog”,“threadId”:1,“threadPriority”:5,“timestamp”:“2022-07-04T10:26:26.404+0000”}
{“instant”:{“epochSecond”:1656930386,“nanoOfSecond”:404000000},“thread”:“main”,“level”:“WARN”,“loggerName”:“org.apache.hadoop.util.NativeCodeLoader”,“message”:“Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”,“endOfBatch”:false,“loggerFqcn”:“org.apache.commons.logging.impl.SLF4JLocationAwareLog”,“threadId”:1,“threadPriority”:5,“timestamp”:“2022-07-04T10:26:26.404+0000”}
{“instant”:{“epochSecond”:1656930389,“nanoOfSecond”:424000000},“thread”:“main”,“level”:“WARN”,“loggerName”:“org.eclipse.jetty.server.handler.gzip.GzipHandler”,“message”:“minGzipSize of 0 is inefficient for short content, break even is size 23”,“endOfBatch”:false,“loggerFqcn”:“org.eclipse.jetty.util.log.Slf4jLog”,“threadId”:1,“threadPriority”:5,“timestamp”:“2022-07-04T10:26:29.424+0000”}
{“instant”:{“epochSecond”:1656930389,“nanoOfSecond”:424000000},“thread”:“main”,“level”:“WARN”,“loggerName”:“org.eclipse.jetty.server.handler.gzip.GzipHandler”,“message”:“minGzipSize of 0 is inefficient for short content, break even is size 23”,“endOfBatch”:false,“loggerFqcn”:“org.eclipse.jetty.util.log.Slf4jLog”,“threadId”:1,“threadPriority”:5,“timestamp”:“2022-07-04T10:26:29.424+0000”}

Also segments have failed to load. The below log is from historicals.

{“instant”:{“epochSecond”:1656931844,“nanoOfSecond”:606000000},“thread”:“ZKCoordinator–0”,“level”:“ERROR”,“loggerName”:“org.apache.druid.server.coordination.SegmentLoadDropHandler”,“message”:"Failed to load segment for dataSource: {class=org.apache.druid.server.coordination.SegmentLoadDropHandler, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[pmdata_2022-07-03T01:00:00.000Z_2022-07-03T02:00:00.000Z_2022-07-03T01:01:18.397Z_68], segment=DataSegment{binaryVersion=9, id=pmdata_2022-07-03T01:00:00.000Z_2022-07-03T02:00:00.000Z_2022-07-03T01:01:18.397Z_68, loadSpec={type=>hdfs, path=>hdfs://apache-hadoop-namenode.nom-apps.svc.cluster.local:8020/druid/segments

Have any segments loaded OK at all from your deep storage? (Looks like it’s HDFS?)

I wonder if some log information is not being captured… Are you able to tail -f the broker log in var/sv to see what is emitted while the process runs?

OOI have you tried to increase the heap size for the process?
This gives some basic guidelines:

Can you please check the health of the HDFS cluster. It seems hadoop client coould not load the mentioned segement (hdfs://apache-hadoop-namenode.nom-apps.svc.cluster.local:8020/druid/segments/pmdata_2022-07-03T01:00:00.000Z_2022-07-03T02:00:00.000Z_2022-07-03T01:01:18.397Z_68

1 Like

HDFS is healthy and no files are corrupted,

Hm maybe there’s some issue with Druid not being able to connect to HDFS OK.

Perhaps confirm that the extension is loading properly (you can see the extensions loaded on the console) and that every node in your cluster has network access to it?

Can you please elaborate a bit on how can I check if the extensions are loaded?

Sure – in the console on the upper-left, you should see the number of extensions – and you can click on that box to see what’s loaded OK;

You can also ping each host individually using this API to get their own local list:

1 Like

I can see the extension for HDFS is loading properly.

Also, now I can see historical-0 shows no error wrt to segment loading issue but Historical-1 is.

{“instant”:{“epochSecond”:1657686512,“nanoOfSecond”:339000000},“thread”:“ZKCoordinator–1”,“level”:“ERROR”,“loggerName”:“org.apache.druid.server.coordination.SegmentLoadDropHandler”,“message”:"Failed to load segment for dataSource: {class=org.apache.druid.server.coordination.SegmentLoadDropHandler, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=Exception loading segment[pmdata_2022-07-05T17:00:00.000Z_2022-07-05T18:00:00.000Z_2022-07-05T17:00:02.515Z_49], segment=DataSegment{binaryVersion=9, id=pmdata_2022-07-05T17:00:00.000Z_2022-07-05T18:00:00.000Z_2022-07-05T17:00:02.515Z_49, loadSpec={type=>hdfs, path=>hdfs://apache-hadoop-namenode.nom-apps.svc.cluster.local:8020/druid/segments/pmdata/20220705T170000.000Z_20220705T180000.000Z/2022-07-05T17_00_02.515Z/49_787da117-05ec-4040-9f3f-12ff7d23dd35_index.zip}

{“instant”:{“epochSecond”:1657688371,“nanoOfSecond”:530000000},“thread”:“Coordinator-Exec–0”,“level”:“WARN”,“loggerName”:“org.apache.druid.server.coordinator.rules.LoadRule”,“message”:“No available [_default_tier] servers or node capacity to assign primary segment[pmdata_2022-07-05T04:00:00.000Z_2022-07-05T05:00:00.000Z_2022-07-05T04:00:01.961Z_53]! Expected Replicants[2]”,“endOfBatch”:false,“loggerFqcn”:“org.apache.logging.slf4j.Log4jLogger”,“threadId”:127,“threadPriority”:5,“timestamp”:“2022-07-13T04:59:31.530+0000”}

This is from the coordinator log.

AHA! What is your total segmentCache locations size? Druid will use up to that amount of space on your server, it does not just consume up to the full available disk space. Configuration reference · Apache Druid

We have set it to 45GiB. And it is around 93% used, how much should be the ideal usage percentage?

I’m afraid I would not be able to answer that – it would be for you to calculate based on your business and what they want to spend haha

You will need some spare capacity for Druid to load the segments into the historicals that you will make available for query, times by the replication factor (see your Load Rules to see what that is currently) – and then you will need extra capacity for your new data incoming. Remember that Druid does not query Deep Storage directly, but the segments are copied into the local segment cache on each historical.