Druid Kafka Indexing Service causing memory leaks in Kafka

Hi,

I’m using Kafka as a messaging system in my data pipeline. I’ve a couple of producer processes in my pipeline and Spark Streaming and Druid’s Kafka indexing service as consumers of Kafka. The indexing service spawns 40 new indexing tasks every 15 mins. From GC logs of Kafka, I see the heap space maxing out in a few hours. The heap memory used on Kafka seems fairly constant for an hour after which it seems to shoot up to the max allocated space. The garbage collection logs of Kafka seems to indicate a memory leak in Kafka. Find below the plots generated from the GC logs.

Kafka Deployment:

5 nodes, with 3 topics and 64 partitions per topic

Kafka Runtime jvm parameters:

8GB Heap Memory

1GC swap Memory

Using G1GC

MaxGCPauseMilllis=20
InitiatingHeapOccupancyPercent=35

Kafka Versions Used:

I’ve used Kafka version 0.10.0, 0.11.0.2 and 1.0.0 and find similar behavior

Questions:

  1. Is this a memory leak on the Kafka side or a misconfiguration of my Kafka cluster? Does Kafka stably handle large number of consumers being added periodically?

  2. As a knock on effect, We also notice kafka partitions going offline periodically after some time with the following error:

ERROR [ReplicaFetcherThread-18-2], Error for partition [topic1,2] to broker 2:org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition. (kafka.server.ReplicaFetcherThread)

Can someone shed some light on the behavior being seen in my cluster? I ran experiments with only producer processes (no Druid). This seems to show pretty healthy GC activity on Kafka. This is more of Kafka related question than Druid, but I was wondering if anyone else using Druid’s Kafka indexing service has seen similar behavior and solved the issue.

Please let me know if more details are needed to root cause the behavior being seen.

Thanks in advance.

Avinash

Hi Avinash,

I haven’t seen something similar. It could be a good question for the Kafka lists. As far as I know Druid isn’t doing anything too crazy. We aren’t even using Kafka’s builtin consumer offset tracking (Druid tracks its own offsets).