Kinesis indexing - OutofMemoryError

Hi team,

I am seeing this kind of error for most of my tasks after the ingestion has been running for a while. A loop of warning “ingestion was throttled” until I get the OutOfMemoryError.

indexer-fail.log (98.4 KB)

Hi Adam,

You are are running out of the heap and hence failing with OOM dumping the heap.

You may would like to increase the task heap to some higher value that what you have currently . You can wither update index heap in the middle manager > task launch paramateres

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

OR if you want to want to increase the indexer heap just for this supervisor - You can add a task context in the supervisor spec updating the XMX under -“druid.indexer.runner.javaOpts”-

Example-

"context":
{
"druid.indexer.runner.javaOpts":"-server -d64 -Xms1g -Xmx10g -XX:MaxDirectMemorySize=12g"
}
Thanks and regards,
Vaibhav

Thanks for the quick reply and the advice. I am already running with druid.indexer.runner.javaOpts=-server -Xms1g -Xmx1g -XX:MaxDirectMemorySize=1g [...] but I may try more.

However, I am not sure this addresses the root cause of the issue. The supervisor tasks run fine for several days and then suddenly they all start to fail with similar message in the logs. So, I think there might be something to dig in the fact that the ingestion keeps getting throttled (if you look at full log). When I look at logs of successful tasks the ingestion rarely gets throttled.
Any idea?

Thanks,
Adam

Well, if you look at heapdump size its 1435079636 bytes which is [~1.4 GB] however you have set max heap for the peon to 1 GB [ -Xmx1g] . This clearly indicates that peons are running out of heap at least for the peon you have attached the log,due to heap exahuation peon could be struggling , enageed in more gc activity , etc , and may have lead to slow segment persist .

I think you should increased the peon heap that should help .

Thanks and regards,
Vaibhav

Hi,

Thank you Vaibah. I followed your advice and set the heap size to 2GB. The ingestion went fine for around a day before it started failing again.

The first task that failed got this particular error.

ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception in run() before persisting.
org.apache.druid.java.util.common.ISE: Starting sequenceNumber [49611094481906332557763756683273290297816080422088475122] is no longer available for partition [shardId-000000000031] (earliest: [null]) and resetOffsetAutomatically is not enabled

As you will see in the log files we get also a lot of warning (we got those in the previous logs as well).

WARN [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Ingestion was throttled for [10,651] millis because persists were pending.

In the subsequent tasks that are launched by the supervisor after the first failed one we get again a whole bunch of this warning which are then replaced by this one and for 5 other workers. The task then fails due to heap space.

WARN [KinesisRecordSupplier-Worker-2] org.apache.druid.indexing.kinesis.KinesisRecordSupplier - OrderedPartitionableRecord buffer full, storing iterator and retrying in [5,000ms]

Do you have an idea how we can go about this? Attached are full log of first task failed, and then second (all the subsequent tasks have logs similar to the first one).

Many thanks,
Adam

poc-2-second-fail.log (104 KB)

poc-2-first-fail.log (344 KB)

ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception in run() before persisting.
org.apache.druid.java.util.common.ISE: Starting sequenceNumber [49611094481906332557763756683273290297816080422088475122] is no longer available for partition [shardId-000000000031] (earliest: [null]) and resetOffsetAutomatically is not enabled

This error indicates that druid is trying to read from sequence number but the same is not available in the kinesis shard maybe data is deleted from the kinesis shard based on the retention period. ResetOffsetAutomatically is set to false ( which is default and recommended). Druid will stop the ingestion and the task will fail to indicate the user that there is something wrong and need to be addressed.

While looking at the second indexing log - I still see the OOM on the peon heap - I feel you may need size the peon well - UNLESS there is any Memory leak ( In that case - Heap dump analysis could help )

I read that you have increased the number of shards for your Kinesis stream -

  1. Did you increase the number shard due to high incoming data volume to your kinesis stream?
  2. What are the changes you made post which you started getting this issue?
  3. What is the total number of shards do you have in the kinesis stream?
  4. What is the taskCount set for your kinesis supervisor?

Maybe you would like to look at -
https://druid.apache.org/docs/latest/development/extensions-core/kinesis-ingestion.html#capacity-planning

Thanks and Regards,
Vaibhav