We are facing an issue with kafka ingestion, where the supervisor is in healthy state, but is not ingesting data. The aggregateLag just keeps piling up. There are no exceptions thrown.
We have tried Hard reset, resubmitting the supervisor but these did not work.
Is there any particular error we can look out for in the logs?
Make sure to use partitions for better throughput. Try increasing memory for the Kafka which might also be an issue.
Change this line to increase memory
export KAFKA_HEAP_OPTS="-Xmx1G -Xms1G" to something like
export KAFKA_HEAP_OPTS="-Xmx32G -Xms1G"
Is the supervisor spawning subtasks? Those logs would be where I would start if they are being spawned.
If they’re not being spawned, do you have enough workers?
This issue generally happens if there is mismatch in Kafka offesets which druid has recorded. Hard reset should work.