InterruptedException during Kafka Ingestion

Hi,

I’ve just deployed a Druid cluster onto AWS using Docker Swarm. There is also a Kafka cluster with 5 nodes. I’ve created an ingestion spec (attached) that reads a topic from Kafka and starts indexing data onto Druid. By using the Druid Web UI, I’m able to see that Druid is able to reach Kafka brokers and read event data out of the topic. When I submit the spec, peons start to fail with below error (full peon log is attached):

2020-03-12T21:50:01,724 INFO [Thread-70] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_kafka_alper-test_0026dab62793c60_biolnedj].
2020-03-12T21:50:01,724 INFO [Thread-70] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping forcefully (status: [READING])
2020-03-12T21:50:01,725 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception in run() before persisting.
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:493) ~[?:?]
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281) ~[?:?]
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1256) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1200) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1176) ~[?:?]
	at org.apache.druid.indexing.kafka.KafkaRecordSupplier.poll(KafkaRecordSupplier.java:124) ~[?:?]
	at org.apache.druid.indexing.kafka.IncrementalPublishingKafkaIndexTaskRunner.getRecords(IncrementalPublishingKafkaIndexTaskRunner.java:111) ~[?:?]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:633) [druid-indexing-service-0.17.0.jar:0.17.0]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:278) [druid-indexing-service-0.17.0.jar:0.17.0]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:164) [druid-indexing-service-0.17.0.jar:0.17.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:419) [druid-indexing-service-0.17.0.jar:0.17.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:391) [druid-indexing-service-0.17.0.jar:0.17.0]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_232]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
Caused by: java.lang.InterruptedException
	... 17 more

Note that I’ve removed the transform part from the ingestion spec to see whether it’s causing my events to drop but it doesn’t seem to. I enabled DEBUG logs to see whether there is an issue that I’m missing with the current level but I couldn’t find anything useful. I’m able to verify that there is data being streamed into the topic and since I’m able to see events in Druid’s UI (while designing spec) I don’t think this is related to a connectivity issue with the brokers. I also tried reading from the latest offset but I’m still getting this same error. I’ve also tested with P3D for both lateMessageRejectionPeriod & earlyMessageRejectionPeriod but that didn’t work either.

Could this be because of a poll timeout? What am I missing?

events_clientSessionStart.json (4.21 KB)

task-log-index_kafka_alper-test_de5ab1312aa10e1_iagoecih.log (3.71 MB)

Hi Alper kanat,

It seems overlord had issued a kill signal to this task. You can grep for this task in the Overlord log for more details.

Please also read through this which will help you to tune the cluster/ingestion further once you know the exact issue.

https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html

Thanks and Regards,

Vaibhav